How to build a local AI proxy to redact PII before LLMs

See how LogRocket's Galileo AI surfaces the most severe issues for you

No signup required

Check it out

Cloud LLMs are often the best option for product quality, but they are difficult to use safely when prompts may contain personally identifiable information (PII). A support summary, patient note, onboarding workflow, or internal ticket can easily include names, emails, phone numbers, account identifiers, or other sensitive values that should not be sent to a third-party model provider.

Stop leaking PII: How to use local SLMs as a privacy buffer for cloud LLMs

A local AI proxy gives teams a practical middle layer. Instead of choosing between a powerful cloud model and a weaker fully local model, you can inspect prompts inside your own environment, replace sensitive values with placeholders, send only the sanitized prompt to the cloud LLM, and restore the original values before returning the response.

In this tutorial, we’ll build that pattern in Node.js. The proxy will use a local small language model (SLM) for PII detection, forward the sanitized prompt to Gemini, and rehydrate the response before sending it back to the user. You’ll also add a deterministic leak-check test suite so CI fails when known PII stops being detected.

By the end, you’ll have:

A Node.js and Express proxy that sits between your application and Gemini
A local GLiNER-based named entity recognition (NER) layer for detecting PII
Regex fallback rules for structured values like email addresses and phone numbers
A request-scoped entity map for replacing and restoring sensitive values
A Node.js test suite that catches redaction regressions before production
A Dockerfile for deploying the proxy with local model weights

🚀 Sign up for The Replay newsletter

The Replay is a weekly newsletter for dev and engineering leaders.

Delivered once a week, it's your curated guide to the most important conversations around frontend dev, emerging AI tools, and the state of modern software.

Why cloud-only LLM workflows leak PII

Cloud-only AI workflows usually fail in one of three ways: teams send sensitive user data to a third-party model, avoid cloud models entirely and accept lower-quality local output, or rely on brittle sanitization logic that misses context-dependent PII.

The first failure is the most obvious. If raw prompts include customer names, employee records, patient details, or account identifiers, sending those prompts to a cloud LLM may create privacy, compliance, and contractual risk. This is the same architectural tension behind many local-first AI systems: the model should be powerful enough to be useful, but the data boundary should remain explicit. Even when the provider has strong security controls, many organizations still need to minimize what leaves their network.

The second failure is more subtle. Some teams respond by banning cloud LLMs outright and moving every task to local models. That can work for narrow use cases, but it often hurts product quality when the task requires stronger reasoning, better instruction following, or broader language coverage.

The third failure is overconfidence in simple redaction. Regex can catch predictable strings like [email protected], but it cannot reliably identify Sarah Chen as a person, distinguish Lincoln the person from Lincoln Memorial, or catch organization-specific identifiers without custom rules.

The AI proxy pattern gives you another option: keep sensitive values local while still using a stronger cloud LLM for the actual generation step.

What is a local AI proxy for PII redaction?

A local AI proxy is middleware that sits between your application and an LLM provider. Every outbound prompt passes through the proxy before it reaches the cloud model.

For PII redaction, the proxy does four things:

Detects sensitive spans in the raw prompt
Replaces each sensitive value with a stable placeholder token
Sends the sanitized prompt to the cloud LLM
Restores the original values in the model response before returning it

The key component is a local SLM. In this tutorial, we’ll use GLiNER, a lightweight NER model that can recognize entity types you provide at inference time. Instead of asking a general-purpose model to generate a redaction plan, GLiNER returns entity spans with labels and offsets. That is a better fit for security-sensitive preprocessing because the output is structured and easier to validate.

We’ll also keep regex in the pipeline. GLiNER is useful for contextual entities like people, places, and organizations, while regex is still the right tool for highly structured values like email addresses and many phone number formats.

Local AI proxy vs. cloud-only and local-only LLMs

The local proxy pattern is not the only way to reduce PII exposure. It is most useful when you need cloud-model quality but cannot send raw prompts directly to a model provider.

Approach	How it works	Where it helps	Main limitation
Cloud-only LLM	Send raw prompts directly to a hosted model	Fastest path to high-quality output	Sensitive values may leave your environment
Local-only LLM	Run the full generation model locally	Strongest data locality	Lower output quality or higher infrastructure cost for complex tasks
Regex-only redaction	Replace known structured patterns before sending prompts	Emails, phone numbers, IDs with predictable formats	Misses context-dependent PII like names
Local AI proxy	Detect and mask PII locally, then call a cloud LLM with placeholders	Balances privacy and output quality	Adds latency, state management, and test requirements

Use the proxy pattern when cloud LLM quality matters, but raw user data should not leave your infrastructure. Avoid it when the task requires the model to reason over the exact sensitive value itself, or when fully local generation already meets the product requirement.

How the scrub-and-rehydrate workflow works

Before writing code, it helps to understand the request lifecycle. Each user prompt moves through four stages:

Detect: The local NER model reads the raw text and returns entity spans, including type and character offsets. A regex pass runs after the model check to catch structured patterns such as emails and phone numbers
Mask: The proxy replaces each detected span with an indexed placeholder like <PERSON_0> or <EMAIL_0>, then records the mapping in a short-lived request context. Duplicate values receive the same token
Infer: The scrubbed text goes to Gemini. The model sees a prompt such as Please copy <PERSON_0> on that reply and generates a response using those placeholders
Rehydrate: The proxy scans the response for placeholder tokens and swaps them back for the original values, so the user receives a natural response

The critical privacy boundary is the model request. Gemini never sees the original values, only the placeholder tokens. The originals stay in process memory for the duration of the request.

Why use GLiNER for local PII detection?

Classic NER models are usually trained around a fixed set of labels, such as PER, ORG, and LOC. GLiNER is more flexible: you provide labels at inference time, such as person, email, credit_card, or account, and the model predicts spans matching those labels.

That makes GLiNER useful for PII detection because you can expand or narrow the entity list without retraining the model. It also avoids the fragility of prompting a general-purpose SLM to return redaction output. A general model might produce inconsistent JSON, include explanations, or hallucinate an entity that was not in the prompt. GLiNER’s job is narrower: return labeled spans.

That said, GLiNER does not replace deterministic checks. For obvious structured patterns, regex remains faster and easier to reason about. The strongest pipeline uses both: NER for context-sensitive values and regex for predictable identifiers.

Project stack and prerequisites

We’ll use the following stack:

Node.js ≥v22
Express for the proxy server
gliner for running the local GLiNER model from Node.js
@google/genai, Google’s current Gen AI JavaScript SDK, for Gemini calls
Node.js’s built-in node:test runner for leak-check tests
Docker for production packaging

The examples use Node.js’s built-in --env-file flag to load environment variables from .env, so you do not need dotenv. The flag was introduced in Node.js v20.6.0 and is available in Node.js v22.

Setting up the Node.js project

Create the project directory and move into it:

mkdir pii-proxy
cd pii-proxy

Initialize the Node.js project:

npm init -y

Open package.json and add "type": "module" so the project can use ES module syntax. While you’re there, replace the placeholder test script with scripts for development, production, and tests:

{
  "name": "pii-proxy",
  "type": "module",
  "version": "1.0.0",
  "description": "Local AI proxy for PII redaction before cloud LLM calls",
  "main": "server.js",
  "scripts": {
    "dev": "node --env-file=.env --watch server.js",
    "start": "node --env-file=.env server.js",
    "test": "node --test test/leak-check.js"
  },
  "keywords": [],
  "author": "",
  "license": "ISC"
}

Install the required dependencies:

npm install express cors gliner @google/genai

Here’s what each dependency does:

express provides the HTTP server framework
cors enables cross-origin requests, which is useful if a frontend runs on a different port
gliner runs the GLiNER model from Node.js
@google/genai provides the current Google Gen AI JavaScript SDK for Gemini

Create a .env file for your Gemini API key:

echo "GEMINI_API_KEY=your_actual_api_key_here" > .env

Replace your_actual_api_key_here with a valid key from Google AI Studio. Never commit .env to source control.

Downloading the GLiNER model

The GLiNER ONNX model must be available before the server can run. Visit the onnx-community/gliner_medium-v2.1 model page on Hugging Face, open the onnx/ directory, and download the ONNX model file. For example, you can use model_int8.onnx.

Rename the downloaded file to gliner_medium-v2.1.onnx, then create a model directory in the project root and place the renamed file there:

mkdir model
# Move the downloaded file into model/gliner_medium-v2.1.onnx

The tokenizer files can still be downloaded and cached automatically from Hugging Face on first use, but the larger ONNX model weights will load from the local file you just added.

Over 200k developers use LogRocket to create better digital experiences

Learn more →

Building the local NER pipeline

The NER pipeline is the local detection layer. It reads raw text, identifies sensitive spans, and returns the offsets needed for masking.

Create a new file named ner.js. Start by importing the Gliner class and declaring a module-level variable for the model instance:

import { Gliner } from "gliner/node";

let glinerInstance = null;

The glinerInstance variable starts as null. It is populated the first time the model is needed, which avoids loading the model during module import.

Next, add an asynchronous helper that lazily initializes and returns the GLiNER instance:

async function getGliner() {
  if (!glinerInstance) {
    glinerInstance = new Gliner({
      tokenizerPath: "onnx-community/gliner_medium-v2.1",
      onnxSettings: {
        modelPath: "model/gliner_medium-v2.1.onnx",
      },
    });
    await glinerInstance.initialize();
  }
  return glinerInstance;
}

The tokenizerPath points to the Hugging Face repository that contains the tokenizer files. The onnxSettings.modelPath value points to the local ONNX model file.

Now define the entity labels and regex patterns the pipeline should detect:

const ENTITY_TYPES = [
  "person",
  "email",
  "phone",
  "address",
  "city",
  "state",
  "country",
  "zipcode",
  "ip_address",
  "national_id",
  "user_id",
  "credit_card",
  "account",
  "token",
];

const REGEX_PATTERNS = [
  {
    type: "EMAIL",
    pattern: /[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}/g,
  },
  {
    type: "PHONE",
    pattern: /\b(?:\+?\d{1,3}[\s.\-]?)?(?:\(?\d{2,4}\)?[\s.\-]?)?\d{3,4}[\s.\-]?\d{3,4}\b/g,
  },
];

You can extend or reduce ENTITY_TYPES depending on your product’s privacy requirements. For example, an internal support tool may need custom labels or regex patterns for employee IDs, ticket IDs, or account numbers.

Add a helper function to validate email strings and filter out obvious false positives:

function isValidEmail(email) {
  if (/\.{2,}/.test(email)) return false;
  if (/^\.|\.$/.test(email)) return false;

  const atIndex = email.indexOf("@");
  if (atIndex === -1) return false;

  const local = email.slice(0, atIndex);
  const domain = email.slice(atIndex + 1);

  if (local.length === 0 || domain.length === 0) return false;
  if (!domain.includes(".")) return false;
  if (/[^a-zA-Z0-9._%+\-]/.test(local)) return false;
  if (/[^a-zA-Z0-9.\-]/.test(domain)) return false;

  return true;
}

This function rejects invalid email candidates before they can be masked. It applies to both model-detected email spans and regex matches.

Now write the main detectEntities function:

export async function detectEntities(text) {
  const gliner = await getGliner();

  const options = {
    flatNer: true,
    threshold: 0.1,
    multiLabel: false,
  };

  const results = await gliner.inference({
    texts: [text],
    entities: ENTITY_TYPES,
    ...options,
  });

  const glinerSpans = results[0] || [];

  const glinerEntities = glinerSpans
    .map((span) => ({
      type: span.label.toUpperCase(),
      start: span.start,
      end: span.end,
      text: text.slice(span.start, span.end),
    }))
    .filter((entity) => {
      if (entity.type === "EMAIL") {
        return isValidEmail(entity.text);
      }
      return true;
    })
    .map(({ type, start, end }) => ({ type, start, end }));

  const regexEntities = [];

  for (const { type, pattern } of REGEX_PATTERNS) {
    for (const match of text.matchAll(pattern)) {
      const candidate = match[0];

      if (type === "EMAIL" && !isValidEmail(candidate)) {
        continue;
      }

      regexEntities.push({
        type,
        start: match.index,
        end: match.index + candidate.length,
      });
    }
  }

  return mergeSpans([...glinerEntities, ...regexEntities]);
}

The threshold value is intentionally low because false negatives are more dangerous than false positives in a privacy boundary. In production, tune this threshold against your own fixture set rather than treating 0.1 as a universal default.

Finally, add mergeSpans to remove overlapping detections:

function mergeSpans(entities) {
  const sorted = [...entities].sort(
    (a, b) => a.start - b.start || b.end - a.end
  );
  const merged = [];

  for (const entity of sorted) {
    const last = merged[merged.length - 1];
    if (last && entity.start < last.end) continue;
    merged.push(entity);
  }

  return merged;
}

The sort keeps earlier spans first. If two spans begin at the same index, the longer span wins. That prevents overlapping replacements from corrupting the masked prompt.

Creating the request context

The entity map links placeholder tokens to original PII. It should only exist for the duration of one request.

Create a file named context.js and add the following RequestContext class:

export class RequestContext {
  constructor() {
    this.entityMap = {};
  }

  store(token, original) {
    this.entityMap[token] = original;
  }

  restore(text) {
    let result = text;

    for (const [token, original] of Object.entries(this.entityMap)) {
      result = result.replaceAll(token, original);
    }

    return result;
  }

  destroy() {
    for (const key of Object.keys(this.entityMap)) {
      delete this.entityMap[key];
    }
  }
}

The store method records a placeholder-to-original mapping. The restore method replaces placeholder tokens in the LLM response. The destroy method explicitly clears the map after the request completes.

In a simple Express handler, the context object would be eligible for garbage collection after the response returns. Clearing it manually adds a defensive layer in case another reference, such as a logger or debugging tool, accidentally persists longer than expected.

Building the PII privacy layer

Create a file named privacy.js. This file handles masking and rehydration.

Start by importing the detection function and defining scrub:

import { detectEntities } from "./ner.js";

export async function scrub(text, context) {
  const entities = await detectEntities(text);

  if (!entities.length) {
    return text;
  }

  const textToToken = new Map();
  const typeCounts = {};

  for (const entity of entities) {
    const original = text.slice(entity.start, entity.end);

    if (!textToToken.has(original)) {
      const count = typeCounts[entity.type] ?? 0;
      typeCounts[entity.type] = count + 1;

      const token = `<${entity.type}_${count}>`;
      textToToken.set(original, token);
      context.store(token, original);
    }
  }

  const sorted = [...entities].sort((a, b) => b.start - a.start);
  let result = text;

  for (const entity of sorted) {
    const original = text.slice(entity.start, entity.end);
    const token = textToToken.get(original);

    if (token) {
      result = result.slice(0, entity.start) + token + result.slice(entity.end);
    }
  }

  return result;
}

This function deduplicates identical PII strings. If the same email appears twice in a prompt, both occurrences receive the same placeholder token.

The replacement loop runs from right to left. That preserves the original character offsets because replacing a later span cannot shift the position of an earlier span.

Now add the rehydrate helper:

export function rehydrate(text, context) {
  return context.restore(text);
}

This small wrapper keeps the privacy layer responsible for both directions of the transformation: scrubbing outbound text and rehydrating inbound text.

Creating the AI proxy server

Create a file named server.js. This file sets up the Express server, calls the privacy layer, forwards sanitized prompts to Gemini, and returns the rehydrated response.

Start with the imports and middleware configuration:

import express from "express";
import cors from "cors";
import { GoogleGenAI } from "@google/genai";
import { scrub, rehydrate } from "./privacy.js";
import { RequestContext } from "./context.js";

if (!process.env.GEMINI_API_KEY) {
  throw new Error("GEMINI_API_KEY is required");
}

const app = express();
app.use(express.json());
app.use(cors());

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

The explicit API key check fails fast if the environment is misconfigured. That is preferable to starting a server that only fails once the first prompt arrives.

Add the main /prompt endpoint:

app.post("/prompt", async (req, res) => {
  const { message } = req.body;

  if (!message || typeof message !== "string") {
    return res.status(400).json({ error: "message must be a non-empty string" });
  }

  const context = new RequestContext();

  try {
    const cleanText = await scrub(message, context);

    const response = await ai.models.generateContent({
      model: "gemini-2.5-flash",
      contents: cleanText,
    });

    const finalReply = rehydrate(response.text, context);

    return res.json({
      reply: finalReply,
      debugInfo:
        process.env.NODE_ENV !== "production"
          ? { cleanText, entityMap: context.entityMap }
          : undefined,
    });
  } catch (error) {
    console.error("Prompt proxy failed", error);
    return res.status(500).json({ error: "Failed to process prompt" });
  } finally {
    context.destroy();
  }
});

The endpoint validates the incoming payload, creates a request-scoped RequestContext, scrubs the prompt, sends only the scrubbed prompt to Gemini, and rehydrates the response.

The debugInfo field is available only outside production. Treat it carefully: entityMap contains the original sensitive values and should never be exposed in production responses or logs.

Finally, start the server on port 3001:

app.listen(3001, () => {
  console.log("Proxy running on http://localhost:3001");
});

Run the server in development mode:

npm run dev

The dev script uses node --env-file=.env --watch server.js. The --env-file flag loads GEMINI_API_KEY before the app starts, and --watch restarts the server when imported files change.

Test the proxy with curl:

curl -s -X POST http://localhost:3001/prompt \
  -H "Content-Type: application/json" \
  -d '{"message": "Draft a meeting invite for Sarah Chen at [email protected], next Tuesday at 2pm."}' \
  | jq .

In a non-production response, inspect debugInfo.cleanText. You should see placeholders rather than raw PII:

{
  "reply": "Subject: Meeting Invite – Tuesday at 2:00 PM\n\nHi Sarah,\n\nI'd like to schedule a meeting for next Tuesday at 2:00 PM. Please find the calendar invite attached.\n\nLooking forward to connecting.\n\nBest regards",
  "debugInfo": {
    "cleanText": "Draft a meeting invite for <PERSON_0> at <EMAIL_0>, next Tuesday at 2pm.",
    "entityMap": {
      "<PERSON_0>": "Sarah Chen",
      "<EMAIL_0>": "[email protected]"
    }
  }
}

The cloud model receives cleanText, not the original prompt. The user receives the rehydrated response.

Testing the proxy before production

The proxy can work locally and still regress later. A threshold change, regex edit, dependency update, or model swap can reintroduce leaks. A small deterministic test suite gives you a CI gate for known PII cases.

Create a test directory:

mkdir test

Inside test, create pii-fixtures.js:

export const fixtures = [
  { text: "Please forward this to Alexandra Kovacs.", type: "PERSON", shouldRedact: true },
  { text: "The package is addressed to James O'Brien.", type: "PERSON", shouldRedact: true },
  { text: "CC: Dr. Yuki Tanaka on all replies.", type: "PERSON", shouldRedact: true },
  { text: "Ask John in accounting to approve it.", type: "PERSON", shouldRedact: true },
  { text: "Send the invoice to [email protected].", type: "EMAIL", shouldRedact: true },
  { text: "My work email is [email protected].", type: "EMAIL", shouldRedact: true },
  { text: "Call me at +1 (415) 555-0192.", type: "PHONE", shouldRedact: true },
  { text: "Fax: 020 7946 0988.", type: "PHONE", shouldRedact: true },
  { text: "The conference is held in Berlin.", type: "LOCATION", shouldRedact: false },
  { text: "The CEO signed off on Thursday.", type: "PERSON", shouldRedact: false },
  { text: "Our office is on Lincoln Avenue.", type: "LOCATION", shouldRedact: false },
];

The shouldRedact: false cases are as important as the positive cases. They catch over-scrubbing, such as mistaking a job title for a person.

Now create test/leak-check.js:

import { describe, it, before } from "node:test";
import assert from "node:assert/strict";
import { detectEntities } from "../ner.js";
import { fixtures } from "./pii-fixtures.js";

describe("PII leak-check", () => {
  before(async () => {
    await detectEntities("warm-up");
  });

  for (const fixture of fixtures) {
    it(`${fixture.shouldRedact ? "detects" : "ignores"}: "${fixture.text.slice(0, 55)}"`, async () => {
      const entities = await detectEntities(fixture.text);
      const detectedTypes = entities.map((entity) => entity.type);
      const wasDetected = detectedTypes.includes(fixture.type);

      if (fixture.shouldRedact) {
        assert.ok(
          wasDetected,
          `LEAK: "${fixture.text}": expected ${fixture.type} to be detected but it was not.`
        );
      } else {
        assert.ok(
          !detectedTypes.includes("PERSON"),
          `OVER-SCRUB: "${fixture.text}": PERSON was detected but this sentence contains no name.`
        );
      }
    });
  }
});

The before hook warms up the model before the individual test cases run. This keeps model initialization from making the first fixture look unusually slow.

Run the tests:

npm test

If the model incorrectly labels a safe sentence as a person, the test output will identify the failing fixture:

✖ ignores: "The CEO signed off on Thursday." (482ms)
  AssertionError [ERR_ASSERTION]: OVER-SCRUB: "The CEO signed off on Thursday.": PERSON was detected but this sentence contains no name.

If you see over-scrubbing, tune the threshold value in ner.js or add post-filtering rules for known false positives. If you see leaks, add fixtures that reproduce them, then adjust the entity labels, regex patterns, or model choice until the test passes.

Deploying the AI proxy with Docker

For production, containerize the proxy and copy the local GLiNER model into the image. Create a Dockerfile in the project root:

FROM node:22-slim
WORKDIR /app
COPY package*.json ./
RUN npm ci --omit=dev
COPY . .
EXPOSE 3001
CMD ["node", "server.js"]

This image installs only production dependencies, copies the application code, and includes the model/ directory containing gliner_medium-v2.1.onnx.

Create a .dockerignore file to keep the image smaller and avoid copying local-only files:

node_modules
npm-debug.log
.env
.git
.idea
.vscode

Build the image:

docker build -t pii-proxy .

Run the container and pass the Gemini API key at runtime:

docker run -p 3001:3001 -e GEMINI_API_KEY=$GEMINI_API_KEY pii-proxy

The API key is supplied as an environment variable and is not baked into the image.

Trade-offs and production limits

The AI proxy pattern reduces PII exposure, but it introduces its own operational constraints.

First-run latency: The model has to initialize before the first detection call. Depending on hardware and model size, this can add a few seconds to the first request. Warm the model at startup or during readiness checks if cold-start latency matters.

Runtime performance: The JavaScript GLiNER path uses ONNX-based inference. It is practical for short prompts, but long documents, transcripts, and multi-page contracts require more careful throughput testing. For high-volume workloads, consider worker pools, native ONNX Runtime bindings, or a separate redaction service.

Detection gaps: No automated detector catches every privacy risk. GLiNER and regex can identify common PII, but domain-specific identifiers, rare formats, and implied PII may still slip through. Add custom regex patterns, expand fixtures, and review logs using sanitized samples rather than raw production prompts.

Entity map state: The entity map must be available to the same process that rehydrates the model response. In a simple synchronous Express request, that is straightforward. In a distributed system with queues, retries, or load-balanced workers, store the map in a short-lived encrypted store, such as Redis with a TTL, keyed by request ID. Delete it immediately after rehydration.

Debugging risk: Debug output is useful during development, but entityMap contains real sensitive values. Keep debug data out of production responses, logs, analytics tools, and error trackers.

Model behavior: Placeholder tokens work well when the cloud model preserves them. For more complex workflows, add tests that verify the model response still includes the expected tokens before rehydration. This matters especially as teams add routing, tool calls, or other production LLM orchestration around the proxy. You may also add system instructions that explicitly tell the model to preserve placeholders exactly.

Conclusion

Using cloud LLMs does not have to mean sending raw sensitive data to a model provider. A local AI proxy gives you a practical privacy buffer: detect PII locally, replace it with stable placeholders, send only sanitized text to the cloud model, and restore the original values before the user sees the response.

The Node.js proxy in this tutorial is intentionally small, but it covers the production shape of the pattern. GLiNER handles context-sensitive entity detection, regex catches structured identifiers, the request context keeps original values local, and the leak-check test suite turns privacy expectations into CI-enforced behavior.

From here, harden the proxy around your own data. Add fixtures from realistic prompts, tune the detection threshold, define domain-specific regex patterns, and decide whether your deployment needs a distributed entity map. The goal is not perfect redaction in the abstract; it is a privacy boundary you can test, monitor, and improve before sensitive data reaches a cloud LLM.

Context rot is slowing down your AI agent: How to fix it

Learn what context rot is, why AI agent sessions degrade over time, and how to fix it with compaction, prompt anchoring, context files, plan files, and RAG.