Factory.ai

Engineering

Research

New

Droid Shield 2.0: learned secret detection

By Factory Research, Octavian Sima - July 1, 2026 - 9 minute read -

Share

Droid Shield 2.0: learned secret detection

Factory's Droids write, refactor, and commit code autonomously at a volume that no human reviewer can completely monitor. Keeping autonomous software engineering safe at enterprise scale means secret detection that is trustworthy but doesn't introduce unnecessary friction.

Droid Shield is an extra line of defense against exposing potential secrets during an autonomous commit and push. It works by going through each individual line of a commit and scanning for known secret-shaped patterns and degrees of randomness indicating sensitive material: on a suspicious encounter, it blocks the commit and returns an error to the user. Our customers, including large enterprises, use Shield to ensure every phase of their autonomous software factories operates safely and securely with confidence.

Limitations

Because the scanner is deterministic, it currently suffers from two failure modes:

  1. False positives: fires on placeholders, examples, fixtures, docs, and non-secret identifiers. This creates friction and trains users to ignore the flag.
  2. False negatives: misses real secrets that do not match its fixed pattern set, reducing confidence for fully autonomous usage patterns.

We're introducing an improved version of Droid Shield with two new fine-tuned models, one for each of these cases. Adjusted to the models' best operating points, they are the strongest or tied-strongest classifiers on both tasks compared to the frontier, at a fraction of the cost and with a significant improvement in latency. We are releasing the model weights in the interest of compounding future work on software security and privacy.

Classification

Our models sit on opposite ends of the deterministic scanner — they don't review the same events and are optimized separately for distinct failure modes.

Risk: runs when the scanner did not fire, but the changed line still looks secret-bearing within the broader context. The catastrophic error is missing a real secret, so we've optimized this model for recall — we accept the tradeoff of introducing some extra warnings if that's what it takes to catch more true positives.

Downgrade: runs when the scanner did fire. It reads a similar context window, but every detected secret candidate is masked before the model sees it. The model must decide from context alone whether the scanner hit should stay blocked or be cleared as a false alarm. Like risk, the catastrophic error is missing a real secret — in this case by clearing one the scanner already caught. We only clear a scanner hit when the model is confident it's a false alarm, trading false-alarm reduction against the risk of letting a real secret through.

Each model emits a binary verdict plus a short user-facing reason, and is scored according to the model's block probability via exposed token probabilities around this verdict. This is a tunable threshold we control that requires no retraining for reconfiguration.

How the two models gate a commit

Each learned model sits on one side of the deterministic scanner.

git commit / push
scannerdeterministic pattern set
pattern hit
Downgrade modelwindow, scrubbed
Clear false alarmor keep it blocked
broader secret-looking line
Risk modelwindow, raw
Warn: possible missed secret

Droid Shield 2.0 routes every commit through the deterministic scanner: a pattern hit goes to the downgrade model (scrubbed window), while an even broader filter selects secret-looking lines to go to the risk model (raw window).

Data

We train and evaluate on data shaped from Samsung's CredData, a public benchmark of real source files and annotated spans. Each example is annotated with T (real secret), F (false positive), or X (unknown).

To improve the quality of judging we used in our generation, we hand-labeled a batch of these unknown samples to be included in both training + evaluation sets. These are then used as grounded reasoning examples for further judging in the data pipeline, including in broader sets of unknown rows.

Risk and downgrade use opposite slices of the same source material:

  • Risk rows: lines the original scanner would not fire on. These keep the raw window, including the candidate value.
  • Downgrade rows: scanner hits. We mask every known credential span across the entire window, so the model learns to judge without ever seeing the secret value(s).

Every example, in both tasks, has the same shape: an input the model sees and a label it learns to emit. The input is the reconstructed code window (lines), the file extension, and the focus_line index of the line under judgment. label is a binary verdict (S = safe, B = real secret) along with a short reason.

Risk keeps the candidate value in the clear (CredData positives are public and obfuscated); this is what the model has to judge after a scanner miss. In this example we see a TOTP secret that sits in what looks like test code, but the shape looks real enough that it should be flagged for review:

json
{
  "input": {
    "extension": "rb",
    "lines": [
      "  it 'meets basic functionality' do",
      "    otp = ROTP::TOTP.new('ILBLD5WABAUC7QAD')",
      "    uri = described_class.new(otp, account_name: 'alice@google.com')",
      "    expect(uri.to_s).to eq 'otpauth://totp/alice%40google.com?secret=GYKNE3FDCUPM3VEZ'",
      "  end",
      "",
      "  it 'includes issuer' do"
    ],
    "focus_line": 3
  },
  "label": {
    "verdict": "B",
    "reason": "Focused line includes an otpauth URI containing a visible 16-character base32-like TOTP secret. Although the surrounding lines are an RSpec test, the value is substantial and credential-shaped, so it warrants warning."
  }
}

Downgrade is the opposite: the scanner already fired, so every detected secret is masked before the model sees it. The model must decide from context alone whether the masked hit was a real secret or a false alarm:

json
{
  "input": {
    "extension": "<none>",
    "lines": [
      "DATABASE_URL=********************************************/postgres",
      ""
    ],
    "focus_line": 0
  },
  "label": {
    "verdict": "B",
    "reason": "Should remain blocked because the focused line assigns a masked DATABASE_URL value, which is a plausible database connection secret and lacks clear placeholder or non-production context."
  }
}

For risk, our training set contains 5,000 rows: 2,000 real secrets (40%), 2,000 false alarms (40%), and 1,000 unknowns (20%). The eval set contains 427 real secrets / 427 false alarms. We chose to keep this balanced rather than attempting to mine a production shape for security reasons.

For downgrade, our training set contains 6,776 rows: 4,726 real secrets (69.7%), 1,940 false alarms (28.6%), and 110 unknowns (1.6%). The eval set contains 90 real secrets and 340 false alarms, about 21% real secrets, matching an aggregate production base rate we've sampled.

Training and evaluation mixes are chosen to match a file-type profile derived from aggregate session data, so the models see a ts/py/md/tsx, etc… skew they will meet in real diffs to learn different language shapes surrounding secrets. An LLM judge ensemble (GPT-5.5 as the primary judge, Opus 4.8 adjudicating) assigns the training verdict + reasoning. We ground this in the actual CredData source of truth to minimize potential for any judging errors and to help the fine-tune learn from a deterministic source.

Risk spans the entire CredData taxonomy, skewing toward passwords. This makes sense: passwords don't have any pre-defined entropy or format, so they are harder to deterministically catch:

Credential categories (risk set)

Share of candidate spans

Password
43%
Key
20%
Token
8%
Secret
8%
Auth
7%
other (29 more)
14%

The downgrade set sees masked scanner hits, so its redacted credentials skew towards those with deterministic shapes. It leans heavily on keys and secrets:

Redacted credential categories (downgrade set)

Share of masked spans

Key
76%
Secret
13%
Password
3%
URL Credentials
3%
Token
2%
other (14 more)
3%

On privacy: Real sessions are never training rows. The only production signals used are aggregate class priors that have been sampled across enough distinct sessions to prevent any identifying information. These are used to calibrate the curriculum and evaluation mix without ever exposing user content.

Training and evaluation

We chose Qwen 3.6 35B A3B as our base model due to its coding strength, reasoning capabilities, and improved cost/latency in comparison to the frontier. For future work, we intend to look into the performance of other even smaller LMs, including Nemotron 3 Nano.

Our final models are a rank-16 LoRA adapter for risk and a rank-64 adapter for downgrade.

  • We experimented with increased rank for both and saw the largest gain on downgrade, where rank 64 showed significant improvement over lower-rank adapters on our evaluation set:

Downgrade ranking quality by LoRA rank

ROC-AUC, repo-level holdout

rank 8
0.697
rank 32
0.748
rank 64
0.845

We run evaluations on repo-level holdouts, where whole repositories from the source data are held out of training.

  • We considered a file-level holdout that keeps those same repositories in training, but found that it carries too much cross-contamination between train and eval, so we do not report it here.

On our choice of risk evaluation split

As mentioned earlier, our risk eval's 427 / 427 split is deliberately balanced. No real shape aggregation was attempted, though we do estimate it would match existing secret-detection and PII problem shapes; typically a very small fraction of true positives.

We report the metrics that survive the balanced setup:

  • Recall: of the real secrets in the holdout, how many did the model catch?
  • FPR: of the non-secrets in the holdout, how many did it incorrectly warn on?
  • ROC-AUC: how well does the model rank real secrets above non-secrets?

Each of these reads only one side of the split, the secrets or the non-secrets, so changing how much of the other side we include doesn't move them. In other words, these metrics are prevalence-invariant; they carry over from the balanced split to whatever the true production rate would be.

Precision and PR-AUC are not as faithful. For example, even at a ~0.1% prevalence a small FPR buries the real secrets in false positives: hold recall at 70% and FPR at 3%, and precision falls from ~96% on the balanced split to ~2% without the model changing at all. Unlike downgrade, whose eval already matches an estimated production base rate, we cannot claim a reliable precision number for risk from this eval.

Results

Fine-tuning improves the classification on our held-out evaluation for both use-cases:

Fine-tuning gains

ROC-AUC, base model vs. fine-tuned adapter

Risk, base
0.925
Risk, fine-tuned
0.975
Downgrade, base
0.707
Downgrade, fine-tuned
0.845

In evaluation against the frontier (GPT-5.5 and Opus 4.8 with default reasoning levels):

Risk ranking quality

ROC-AUC vs. the frontier

Fine-tuned LoRA
0.975
Opus 4.8
0.961
GPT-5.5
0.948
Base Qwen 3.6
0.925

Downgrade ranking quality

ROC-AUC vs. the frontier

Fine-tuned LoRA
0.845
GPT-5.5
0.819
Opus 4.8
0.800
Base Qwen 3.6
0.707

Both gates optimize for recall, but each uses a different tradeoff mechanism. We place each evaluated model at its own best operating point for the chosen objective for a completely faithful comparison and to account for any implicit differences in reporting confidence.

Risk is recall-first under a false-positive budget according to a Wilson 95% upper bound. We report two false-positive budgets: a strict FPR <= 0.05, and the looser FPR <= 0.10 that the shipped risk gate currently runs at. We hold the rates within budget on this upper bound and take the best recall each model reaches.

At a strict 0.05 budget:

Risk gate: best recall at FPR ≤ 0.05

Wilson 95% CI, n = 427

ModelRecall95% CIFPR
Fine-tuned LoRA0.698[0.653, 0.740]0.028
GPT-5.50.588[0.541, 0.634]0.021
Opus 4.80.574[0.526, 0.620]0.028
Base Qwen 3.60.563[0.516, 0.610]0.023

At a more lenient 0.10 budget:

Risk gate: best recall at FPR ≤ 0.10

Wilson 95% CI, n = 427

ModelRecall95% CIFPR
Fine-tuned LoRA0.878[0.844, 0.906]0.070
Opus 4.80.852[0.816, 0.883]0.070
GPT-5.50.707[0.662, 0.748]0.054
Base Qwen 3.60.629[0.582, 0.674]0.066

The fine-tuned adapter reaches the highest in-budget recall at both budgets.

  • At the strict 0.05 budget, our adapter is ahead of the GPT-5.5 runner-up (non-overlapping 95% intervals, 0.698 vs 0.588).
  • At the 0.10 budget, our adapter is statistically tied with Opus 4.8 (their intervals overlap, 0.878 vs 0.852).

Downgrade shares the same recall priority, but the catastrophic error is clearing a real secret the scanner already caught. There's no natural false-positive ceiling to budget against because the scanner already blocked these lines; retaining a false alarm is the status quo, not a new cost. Instead, we score each threshold by net utility: the false alarms it clears minus lambda times the real secrets it wrongly clears.

lambda is an exchange rate: how many correctly cleared false alarms one missed secret is worth. At lambda = x a missed secret costs as much as x good clears. Raising lambda pushes this threshold down and trades more retained false alarms for higher secret recall. We sweep lambda over {1, 2, 5, 10} and ship lambda = 5.

At each model's lambda = 5 operating point on the unbalanced holdout:

Downgrade gate: λ = 5 operating point

Same repo-level holdout

ModelRecallPrecisionClears false alarms
Fine-tuned LoRA0.8560.4050.668
GPT-5.50.8000.4710.762
Opus 4.80.7670.3520.626
Base Qwen 3.60.5890.3270.679

At this cost setting, our adapter retains the most real secrets. GPT-5.5 clears more false alarms and has higher block precision, but at the cost of wrongly attempting to downgrade more true positives.

Because this gate is negative-heavy, with only about 21% real secrets, PR-AUC is also relevant: it captures how concentrated real secrets are among the items the gate would keep blocked:

Downgrade gate: PR-AUC

Negative-heavy holdout

Fine-tuned LoRA
0.543
GPT-5.5
0.541
Opus 4.8
0.528
Base Qwen 3.6
0.380

The frontier models remain competitive across the rest of the operating curve, including in the low false-positive region.

Open weights

Both fine-tunes are available on Hugging Face for the community to run locally, inspect verdicts and reasoning, and continue to build on.

Risk model

Downgrade model

Included are the PEFT LoRA adapter weights, tokenizer and config files, the exact system prompt used for training, and additional guidance for how to read and calibrate the log probabilities in the model outputs.

We're releasing the weights to make this work usable beyond Factory and to reaffirm our commitment to the open source AI community. We welcome any contribution or collaboration in improving these models for everyone.

Considerations and limitations

Risk dataset: We lean on ranking quality to account for limited positives; real secrets are rare and complicated to mine for, so the risk eval is deliberately balanced. We report only prevalence-invariant metrics (recall, FPR, ROC-AUC). The candidate pool is enriched: risk candidates are mined by secret-name keywords plus a thin random sample of non-firing edits, so the pool over-represents deterministic secret-shaped lines rather than semantic similarities.

Downgrade dataset: Because this model has to judge from context where every detected value is masked, there's a hard limit on the signal we can include in the data. Enriching the code window with non-sensitive value shape and entropy remains future work that requires additional production changes. On the negative-heavy holdout, PR-AUC sits at ~0.54 and is essentially tied with GPT-5.5.

Data, evaluation, and privacy: We do not store any scanner or secret detections in our analytics; class priors come from cached judge labels over historic hits, not a live stream. Our eval datasets are entirely public benchmark content re-weighted to this prior. Real private repositories may carry secret and code shapes this fails to represent. Training verdicts come from an LLM judge ensemble anchored on a hand-labeled batch of otherwise-unknown samples; this keeps the labels grounded, but susceptible to inheriting the judge's bias.

Deployment: Each gate is placed at a single threshold cut selected on our fixed evaluation holdout. This does not always hold up in real traffic and we expect to further refine these operating points. The frontier comparison is coarser for closed models: GPT-5.5 and Opus 4.8 don't expose any token probabilities, so we have to tune them on a self-reported confidence rather than the calibrated score we read from our own models. This does introduce noise: we reran the samples with and without the confidence clause and observed matching verdicts on 95.4% of GPT risk rows, 91.3% of GPT downgrade rows, 97.9% of Opus risk rows, and 98.0% of Opus downgrade rows.


We present an improved version of Droid Shield: semantic secret detection from broader code context to augment deterministic scanning. Evaluated on a public benchmark representative of what we expect real traffic to look like, we show that our fine-tuned adapters are competitive with frontier LLMs while being smaller and faster.

As Droids take on more of an organization's commit volume, security is at the forefront of Factory's priorities. A strong protection layer is what keeps autonomous software engineering safe to scale.

Droid Shield 2.0 is currently in research preview. If you'd like access enabled for your organization, please reach out to our team.

If this work excites you, join us!

start building

Ready to build the software of the future?

Start building

Arrow Right Icon