Cloud LLMs are often the best option for product quality, but they are difficult to use safely when prompts may contain personally identifiable information (PII). A support summary, patient note, onboarding workflow, or internal ticket can easily include names, emails, phone numbers, account identifiers, or other sensitive values that should not be sent to a third-party model provider.
A local AI proxy gives teams a practical middle layer. Instead of choosing between a powerful cloud model and a weaker fully local model, you can inspect prompts inside your own environment, replace sensitive values with placeholders, send only the sanitized prompt to the cloud LLM, and restore the original values before returning the response.
In this tutorial, we’ll build that pattern in Node.js. The proxy will use a local small language model (SLM) for PII detection, forward the sanitized prompt to Gemini, and rehydrate the response before sending it back to the user. You’ll also add a deterministic leak-check test suite so CI fails when known PII stops being detected.
By the end, you’ll have:
The Replay is a weekly newsletter for dev and engineering leaders.
Delivered once a week, it's your curated guide to the most important conversations around frontend dev, emerging AI tools, and the state of modern software.
Cloud-only AI workflows usually fail in one of three ways: teams send sensitive user data to a third-party model, avoid cloud models entirely and accept lower-quality local output, or rely on brittle sanitization logic that misses context-dependent PII.
The first failure is the most obvious. If raw prompts include customer names, employee records, patient details, or account identifiers, sending those prompts to a cloud LLM may create privacy, compliance, and contractual risk. This is the same architectural tension behind many local-first AI systems: the model should be powerful enough to be useful, but the data boundary should remain explicit. Even when the provider has strong security controls, many organizations still need to minimize what leaves their network.
The second failure is more subtle. Some teams respond by banning cloud LLMs outright and moving every task to local models. That can work for narrow use cases, but it often hurts product quality when the task requires stronger reasoning, better instruction following, or broader language coverage.
The third failure is overconfidence in simple redaction. Regex can catch predictable strings like [email protected], but it cannot reliably identify Sarah Chen as a person, distinguish Lincoln the person from Lincoln Memorial, or catch organization-specific identifiers without custom rules.
The AI proxy pattern gives you another option: keep sensitive values local while still using a stronger cloud LLM for the actual generation step.
A local AI proxy is middleware that sits between your application and an LLM provider. Every outbound prompt passes through the proxy before it reaches the cloud model.
For PII redaction, the proxy does four things:
The key component is a local SLM. In this tutorial, we’ll use GLiNER, a lightweight NER model that can recognize entity types you provide at inference time. Instead of asking a general-purpose model to generate a redaction plan, GLiNER returns entity spans with labels and offsets. That is a better fit for security-sensitive preprocessing because the output is structured and easier to validate.
We’ll also keep regex in the pipeline. GLiNER is useful for contextual entities like people, places, and organizations, while regex is still the right tool for highly structured values like email addresses and many phone number formats.
The local proxy pattern is not the only way to reduce PII exposure. It is most useful when you need cloud-model quality but cannot send raw prompts directly to a model provider.
| Approach | How it works | Where it helps | Main limitation |
|---|---|---|---|
| Cloud-only LLM | Send raw prompts directly to a hosted model | Fastest path to high-quality output | Sensitive values may leave your environment |
| Local-only LLM | Run the full generation model locally | Strongest data locality | Lower output quality or higher infrastructure cost for complex tasks |
| Regex-only redaction | Replace known structured patterns before sending prompts | Emails, phone numbers, IDs with predictable formats | Misses context-dependent PII like names |
| Local AI proxy | Detect and mask PII locally, then call a cloud LLM with placeholders | Balances privacy and output quality | Adds latency, state management, and test requirements |
Use the proxy pattern when cloud LLM quality matters, but raw user data should not leave your infrastructure. Avoid it when the task requires the model to reason over the exact sensitive value itself, or when fully local generation already meets the product requirement.
Before writing code, it helps to understand the request lifecycle. Each user prompt moves through four stages:
<PERSON_0> or <EMAIL_0>, then records the mapping in a short-lived request context. Duplicate values receive the same tokenPlease copy <PERSON_0> on that reply and generates a response using those placeholdersThe critical privacy boundary is the model request. Gemini never sees the original values, only the placeholder tokens. The originals stay in process memory for the duration of the request.
Classic NER models are usually trained around a fixed set of labels, such as PER, ORG, and LOC. GLiNER is more flexible: you provide labels at inference time, such as person, email, credit_card, or account, and the model predicts spans matching those labels.
That makes GLiNER useful for PII detection because you can expand or narrow the entity list without retraining the model. It also avoids the fragility of prompting a general-purpose SLM to return redaction output. A general model might produce inconsistent JSON, include explanations, or hallucinate an entity that was not in the prompt. GLiNER’s job is narrower: return labeled spans.
That said, GLiNER does not replace deterministic checks. For obvious structured patterns, regex remains faster and easier to reason about. The strongest pipeline uses both: NER for context-sensitive values and regex for predictable identifiers.
We’ll use the following stack:
gliner for running the local GLiNER model from Node.js@google/genai, Google’s current Gen AI JavaScript SDK, for Gemini callsnode:test runner for leak-check testsThe examples use Node.js’s built-in --env-file flag to load environment variables from .env, so you do not need dotenv. The flag was introduced in Node.js v20.6.0 and is available in Node.js v22.
Create the project directory and move into it:
mkdir pii-proxy cd pii-proxy
Initialize the Node.js project:
npm init -y
Open package.json and add "type": "module" so the project can use ES module syntax. While you’re there, replace the placeholder test script with scripts for development, production, and tests:
{
"name": "pii-proxy",
"type": "module",
"version": "1.0.0",
"description": "Local AI proxy for PII redaction before cloud LLM calls",
"main": "server.js",
"scripts": {
"dev": "node --env-file=.env --watch server.js",
"start": "node --env-file=.env server.js",
"test": "node --test test/leak-check.js"
},
"keywords": [],
"author": "",
"license": "ISC"
}
Install the required dependencies:
npm install express cors gliner @google/genai
Here’s what each dependency does:
express provides the HTTP server frameworkcors enables cross-origin requests, which is useful if a frontend runs on a different portgliner runs the GLiNER model from Node.js@google/genai provides the current Google Gen AI JavaScript SDK for GeminiCreate a .env file for your Gemini API key:
echo "GEMINI_API_KEY=your_actual_api_key_here" > .env
Replace your_actual_api_key_here with a valid key from Google AI Studio. Never commit .env to source control.
The GLiNER ONNX model must be available before the server can run. Visit the onnx-community/gliner_medium-v2.1 model page on Hugging Face, open the onnx/ directory, and download the ONNX model file. For example, you can use model_int8.onnx.
Rename the downloaded file to gliner_medium-v2.1.onnx, then create a model directory in the project root and place the renamed file there:
mkdir model # Move the downloaded file into model/gliner_medium-v2.1.onnx
The tokenizer files can still be downloaded and cached automatically from Hugging Face on first use, but the larger ONNX model weights will load from the local file you just added.
The NER pipeline is the local detection layer. It reads raw text, identifies sensitive spans, and returns the offsets needed for masking.
Create a new file named ner.js. Start by importing the Gliner class and declaring a module-level variable for the model instance:
import { Gliner } from "gliner/node";
let glinerInstance = null;
The glinerInstance variable starts as null. It is populated the first time the model is needed, which avoids loading the model during module import.
Next, add an asynchronous helper that lazily initializes and returns the GLiNER instance:
async function getGliner() {
if (!glinerInstance) {
glinerInstance = new Gliner({
tokenizerPath: "onnx-community/gliner_medium-v2.1",
onnxSettings: {
modelPath: "model/gliner_medium-v2.1.onnx",
},
});
await glinerInstance.initialize();
}
return glinerInstance;
}
The tokenizerPath points to the Hugging Face repository that contains the tokenizer files. The onnxSettings.modelPath value points to the local ONNX model file.
Now define the entity labels and regex patterns the pipeline should detect:
const ENTITY_TYPES = [
"person",
"email",
"phone",
"address",
"city",
"state",
"country",
"zipcode",
"ip_address",
"national_id",
"user_id",
"credit_card",
"account",
"token",
];
const REGEX_PATTERNS = [
{
type: "EMAIL",
pattern: /[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}/g,
},
{
type: "PHONE",
pattern: /\b(?:\+?\d{1,3}[\s.\-]?)?(?:\(?\d{2,4}\)?[\s.\-]?)?\d{3,4}[\s.\-]?\d{3,4}\b/g,
},
];
You can extend or reduce ENTITY_TYPES depending on your product’s privacy requirements. For example, an internal support tool may need custom labels or regex patterns for employee IDs, ticket IDs, or account numbers.
Add a helper function to validate email strings and filter out obvious false positives:
function isValidEmail(email) {
if (/\.{2,}/.test(email)) return false;
if (/^\.|\.$/.test(email)) return false;
const atIndex = email.indexOf("@");
if (atIndex === -1) return false;
const local = email.slice(0, atIndex);
const domain = email.slice(atIndex + 1);
if (local.length === 0 || domain.length === 0) return false;
if (!domain.includes(".")) return false;
if (/[^a-zA-Z0-9._%+\-]/.test(local)) return false;
if (/[^a-zA-Z0-9.\-]/.test(domain)) return false;
return true;
}
This function rejects invalid email candidates before they can be masked. It applies to both model-detected email spans and regex matches.
Now write the main detectEntities function:
export async function detectEntities(text) {
const gliner = await getGliner();
const options = {
flatNer: true,
threshold: 0.1,
multiLabel: false,
};
const results = await gliner.inference({
texts: [text],
entities: ENTITY_TYPES,
...options,
});
const glinerSpans = results[0] || [];
const glinerEntities = glinerSpans
.map((span) => ({
type: span.label.toUpperCase(),
start: span.start,
end: span.end,
text: text.slice(span.start, span.end),
}))
.filter((entity) => {
if (entity.type === "EMAIL") {
return isValidEmail(entity.text);
}
return true;
})
.map(({ type, start, end }) => ({ type, start, end }));
const regexEntities = [];
for (const { type, pattern } of REGEX_PATTERNS) {
for (const match of text.matchAll(pattern)) {
const candidate = match[0];
if (type === "EMAIL" && !isValidEmail(candidate)) {
continue;
}
regexEntities.push({
type,
start: match.index,
end: match.index + candidate.length,
});
}
}
return mergeSpans([...glinerEntities, ...regexEntities]);
}
The threshold value is intentionally low because false negatives are more dangerous than false positives in a privacy boundary. In production, tune this threshold against your own fixture set rather than treating 0.1 as a universal default.
Finally, add mergeSpans to remove overlapping detections:
function mergeSpans(entities) {
const sorted = [...entities].sort(
(a, b) => a.start - b.start || b.end - a.end
);
const merged = [];
for (const entity of sorted) {
const last = merged[merged.length - 1];
if (last && entity.start < last.end) continue;
merged.push(entity);
}
return merged;
}
The sort keeps earlier spans first. If two spans begin at the same index, the longer span wins. That prevents overlapping replacements from corrupting the masked prompt.
The entity map links placeholder tokens to original PII. It should only exist for the duration of one request.
Create a file named context.js and add the following RequestContext class:
export class RequestContext {
constructor() {
this.entityMap = {};
}
store(token, original) {
this.entityMap[token] = original;
}
restore(text) {
let result = text;
for (const [token, original] of Object.entries(this.entityMap)) {
result = result.replaceAll(token, original);
}
return result;
}
destroy() {
for (const key of Object.keys(this.entityMap)) {
delete this.entityMap[key];
}
}
}
The store method records a placeholder-to-original mapping. The restore method replaces placeholder tokens in the LLM response. The destroy method explicitly clears the map after the request completes.
In a simple Express handler, the context object would be eligible for garbage collection after the response returns. Clearing it manually adds a defensive layer in case another reference, such as a logger or debugging tool, accidentally persists longer than expected.
Create a file named privacy.js. This file handles masking and rehydration.
Start by importing the detection function and defining scrub:
import { detectEntities } from "./ner.js";
export async function scrub(text, context) {
const entities = await detectEntities(text);
if (!entities.length) {
return text;
}
const textToToken = new Map();
const typeCounts = {};
for (const entity of entities) {
const original = text.slice(entity.start, entity.end);
if (!textToToken.has(original)) {
const count = typeCounts[entity.type] ?? 0;
typeCounts[entity.type] = count + 1;
const token = `<${entity.type}_${count}>`;
textToToken.set(original, token);
context.store(token, original);
}
}
const sorted = [...entities].sort((a, b) => b.start - a.start);
let result = text;
for (const entity of sorted) {
const original = text.slice(entity.start, entity.end);
const token = textToToken.get(original);
if (token) {
result = result.slice(0, entity.start) + token + result.slice(entity.end);
}
}
return result;
}
This function deduplicates identical PII strings. If the same email appears twice in a prompt, both occurrences receive the same placeholder token.
The replacement loop runs from right to left. That preserves the original character offsets because replacing a later span cannot shift the position of an earlier span.
Now add the rehydrate helper:
export function rehydrate(text, context) {
return context.restore(text);
}
This small wrapper keeps the privacy layer responsible for both directions of the transformation: scrubbing outbound text and rehydrating inbound text.
Create a file named server.js. This file sets up the Express server, calls the privacy layer, forwards sanitized prompts to Gemini, and returns the rehydrated response.
Start with the imports and middleware configuration:
import express from "express";
import cors from "cors";
import { GoogleGenAI } from "@google/genai";
import { scrub, rehydrate } from "./privacy.js";
import { RequestContext } from "./context.js";
if (!process.env.GEMINI_API_KEY) {
throw new Error("GEMINI_API_KEY is required");
}
const app = express();
app.use(express.json());
app.use(cors());
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
The explicit API key check fails fast if the environment is misconfigured. That is preferable to starting a server that only fails once the first prompt arrives.
Add the main /prompt endpoint:
app.post("/prompt", async (req, res) => {
const { message } = req.body;
if (!message || typeof message !== "string") {
return res.status(400).json({ error: "message must be a non-empty string" });
}
const context = new RequestContext();
try {
const cleanText = await scrub(message, context);
const response = await ai.models.generateContent({
model: "gemini-2.5-flash",
contents: cleanText,
});
const finalReply = rehydrate(response.text, context);
return res.json({
reply: finalReply,
debugInfo:
process.env.NODE_ENV !== "production"
? { cleanText, entityMap: context.entityMap }
: undefined,
});
} catch (error) {
console.error("Prompt proxy failed", error);
return res.status(500).json({ error: "Failed to process prompt" });
} finally {
context.destroy();
}
});
The endpoint validates the incoming payload, creates a request-scoped RequestContext, scrubs the prompt, sends only the scrubbed prompt to Gemini, and rehydrates the response.
The debugInfo field is available only outside production. Treat it carefully: entityMap contains the original sensitive values and should never be exposed in production responses or logs.
Finally, start the server on port 3001:
app.listen(3001, () => {
console.log("Proxy running on http://localhost:3001");
});
Run the server in development mode:
npm run dev
The dev script uses node --env-file=.env --watch server.js. The --env-file flag loads GEMINI_API_KEY before the app starts, and --watch restarts the server when imported files change.
Test the proxy with curl:
curl -s -X POST http://localhost:3001/prompt \
-H "Content-Type: application/json" \
-d '{"message": "Draft a meeting invite for Sarah Chen at [email protected], next Tuesday at 2pm."}' \
| jq .
In a non-production response, inspect debugInfo.cleanText. You should see placeholders rather than raw PII:
{
"reply": "Subject: Meeting Invite – Tuesday at 2:00 PM\n\nHi Sarah,\n\nI'd like to schedule a meeting for next Tuesday at 2:00 PM. Please find the calendar invite attached.\n\nLooking forward to connecting.\n\nBest regards",
"debugInfo": {
"cleanText": "Draft a meeting invite for <PERSON_0> at <EMAIL_0>, next Tuesday at 2pm.",
"entityMap": {
"<PERSON_0>": "Sarah Chen",
"<EMAIL_0>": "[email protected]"
}
}
}
The cloud model receives cleanText, not the original prompt. The user receives the rehydrated response.
The proxy can work locally and still regress later. A threshold change, regex edit, dependency update, or model swap can reintroduce leaks. A small deterministic test suite gives you a CI gate for known PII cases.
Create a test directory:
mkdir test
Inside test, create pii-fixtures.js:
export const fixtures = [
{ text: "Please forward this to Alexandra Kovacs.", type: "PERSON", shouldRedact: true },
{ text: "The package is addressed to James O'Brien.", type: "PERSON", shouldRedact: true },
{ text: "CC: Dr. Yuki Tanaka on all replies.", type: "PERSON", shouldRedact: true },
{ text: "Ask John in accounting to approve it.", type: "PERSON", shouldRedact: true },
{ text: "Send the invoice to [email protected].", type: "EMAIL", shouldRedact: true },
{ text: "My work email is [email protected].", type: "EMAIL", shouldRedact: true },
{ text: "Call me at +1 (415) 555-0192.", type: "PHONE", shouldRedact: true },
{ text: "Fax: 020 7946 0988.", type: "PHONE", shouldRedact: true },
{ text: "The conference is held in Berlin.", type: "LOCATION", shouldRedact: false },
{ text: "The CEO signed off on Thursday.", type: "PERSON", shouldRedact: false },
{ text: "Our office is on Lincoln Avenue.", type: "LOCATION", shouldRedact: false },
];
The shouldRedact: false cases are as important as the positive cases. They catch over-scrubbing, such as mistaking a job title for a person.
Now create test/leak-check.js:
import { describe, it, before } from "node:test";
import assert from "node:assert/strict";
import { detectEntities } from "../ner.js";
import { fixtures } from "./pii-fixtures.js";
describe("PII leak-check", () => {
before(async () => {
await detectEntities("warm-up");
});
for (const fixture of fixtures) {
it(`${fixture.shouldRedact ? "detects" : "ignores"}: "${fixture.text.slice(0, 55)}"`, async () => {
const entities = await detectEntities(fixture.text);
const detectedTypes = entities.map((entity) => entity.type);
const wasDetected = detectedTypes.includes(fixture.type);
if (fixture.shouldRedact) {
assert.ok(
wasDetected,
`LEAK: "${fixture.text}": expected ${fixture.type} to be detected but it was not.`
);
} else {
assert.ok(
!detectedTypes.includes("PERSON"),
`OVER-SCRUB: "${fixture.text}": PERSON was detected but this sentence contains no name.`
);
}
});
}
});
The before hook warms up the model before the individual test cases run. This keeps model initialization from making the first fixture look unusually slow.
Run the tests:
npm test
If the model incorrectly labels a safe sentence as a person, the test output will identify the failing fixture:
✖ ignores: "The CEO signed off on Thursday." (482ms) AssertionError [ERR_ASSERTION]: OVER-SCRUB: "The CEO signed off on Thursday.": PERSON was detected but this sentence contains no name.
If you see over-scrubbing, tune the threshold value in ner.js or add post-filtering rules for known false positives. If you see leaks, add fixtures that reproduce them, then adjust the entity labels, regex patterns, or model choice until the test passes.
For production, containerize the proxy and copy the local GLiNER model into the image. Create a Dockerfile in the project root:
FROM node:22-slim WORKDIR /app COPY package*.json ./ RUN npm ci --omit=dev COPY . . EXPOSE 3001 CMD ["node", "server.js"]
This image installs only production dependencies, copies the application code, and includes the model/ directory containing gliner_medium-v2.1.onnx.
Create a .dockerignore file to keep the image smaller and avoid copying local-only files:
node_modules npm-debug.log .env .git .idea .vscode
Build the image:
docker build -t pii-proxy .
Run the container and pass the Gemini API key at runtime:
docker run -p 3001:3001 -e GEMINI_API_KEY=$GEMINI_API_KEY pii-proxy
The API key is supplied as an environment variable and is not baked into the image.
The AI proxy pattern reduces PII exposure, but it introduces its own operational constraints.
First-run latency: The model has to initialize before the first detection call. Depending on hardware and model size, this can add a few seconds to the first request. Warm the model at startup or during readiness checks if cold-start latency matters.
Runtime performance: The JavaScript GLiNER path uses ONNX-based inference. It is practical for short prompts, but long documents, transcripts, and multi-page contracts require more careful throughput testing. For high-volume workloads, consider worker pools, native ONNX Runtime bindings, or a separate redaction service.
Detection gaps: No automated detector catches every privacy risk. GLiNER and regex can identify common PII, but domain-specific identifiers, rare formats, and implied PII may still slip through. Add custom regex patterns, expand fixtures, and review logs using sanitized samples rather than raw production prompts.
Entity map state: The entity map must be available to the same process that rehydrates the model response. In a simple synchronous Express request, that is straightforward. In a distributed system with queues, retries, or load-balanced workers, store the map in a short-lived encrypted store, such as Redis with a TTL, keyed by request ID. Delete it immediately after rehydration.
Debugging risk: Debug output is useful during development, but entityMap contains real sensitive values. Keep debug data out of production responses, logs, analytics tools, and error trackers.
Model behavior: Placeholder tokens work well when the cloud model preserves them. For more complex workflows, add tests that verify the model response still includes the expected tokens before rehydration. This matters especially as teams add routing, tool calls, or other production LLM orchestration around the proxy. You may also add system instructions that explicitly tell the model to preserve placeholders exactly.
Using cloud LLMs does not have to mean sending raw sensitive data to a model provider. A local AI proxy gives you a practical privacy buffer: detect PII locally, replace it with stable placeholders, send only sanitized text to the cloud model, and restore the original values before the user sees the response.
The Node.js proxy in this tutorial is intentionally small, but it covers the production shape of the pattern. GLiNER handles context-sensitive entity detection, regex catches structured identifiers, the request context keeps original values local, and the leak-check test suite turns privacy expectations into CI-enforced behavior.
From here, harden the proxy around your own data. Add fixtures from realistic prompts, tune the detection threshold, define domain-specific regex patterns, and decide whether your deployment needs a distributed entity map. The goal is not perfect redaction in the abstract; it is a privacy boundary you can test, monitor, and improve before sensitive data reaches a cloud LLM.

Learn how Graph RAG uses connected knowledge structures to improve retrieval beyond simple text similarity.

Learn how sibling-index() enables clean, JavaScript-free stagger animations using native CSS.

useEffect breaks AI streaming responses in ReactSee why useEffect breaks AI streaming in React, and how moving stream state outside React fixes flicker and stale updates.

A real-world debugging session using Claude to solve a tricky Next.js UI bug, exploring how AI helps, where it struggles, and what actually fixed the issue.
Would you be interested in joining LogRocket's developer community?
Join LogRocket’s Content Advisory Board. You’ll help inform the type of content we create and get access to exclusive meetups, social accreditation, and swag.
Sign up now