Engineering
Research
New
Droid Shield 2.0: learned secret detection
By Factory Research, Octavian Sima - July 1, 2026 - 9 minute read -
Share
Engineering
Research
New
By Factory Research, Octavian Sima - July 1, 2026 - 9 minute read -
Share
Factory's Droids write, refactor, and commit code autonomously at a volume that no human reviewer can completely monitor. Keeping autonomous software engineering safe at enterprise scale means secret detection that is trustworthy but doesn't introduce unnecessary friction.
Droid Shield is an extra line of defense against exposing potential secrets during an autonomous commit and push. It works by going through each individual line of a commit and scanning for known secret-shaped patterns and degrees of randomness indicating sensitive material: on a suspicious encounter, it blocks the commit and returns an error to the user. Our customers, including large enterprises, use Shield to ensure every phase of their autonomous software factories operates safely and securely with confidence.
Because the scanner is deterministic, it currently suffers from two failure modes:
We're introducing an improved version of Droid Shield with two new fine-tuned models, one for each of these cases. Adjusted to the models' best operating points, they are the strongest or tied-strongest classifiers on both tasks compared to the frontier, at a fraction of the cost and with a significant improvement in latency. We are releasing the model weights in the interest of compounding future work on software security and privacy.
Our models sit on opposite ends of the deterministic scanner — they don't review the same events and are optimized separately for distinct failure modes.
Risk: runs when the scanner did not fire, but the changed line still looks secret-bearing within the broader context. The catastrophic error is missing a real secret, so we've optimized this model for recall — we accept the tradeoff of introducing some extra warnings if that's what it takes to catch more true positives.
Downgrade: runs when the scanner did fire. It reads a similar context window, but every detected secret candidate is masked before the model sees it. The model must decide from context alone whether the scanner hit should stay blocked or be cleared as a false alarm. Like risk, the catastrophic error is missing a real secret — in this case by clearing one the scanner already caught. We only clear a scanner hit when the model is confident it's a false alarm, trading false-alarm reduction against the risk of letting a real secret through.
Each model emits a binary verdict plus a short user-facing reason, and is scored according to the model's block probability via exposed token probabilities around this verdict. This is a tunable threshold we control that requires no retraining for reconfiguration.
Each learned model sits on one side of the deterministic scanner.
Droid Shield 2.0 routes every commit through the deterministic scanner: a pattern hit goes to the downgrade model (scrubbed window), while an even broader filter selects secret-looking lines to go to the risk model (raw window).
We train and evaluate on data shaped from Samsung's CredData, a public benchmark of real source files and annotated spans. Each example is annotated with T (real secret), F (false positive), or X (unknown).
To improve the quality of judging we used in our generation, we hand-labeled a batch of these unknown samples to be included in both training + evaluation sets. These are then used as grounded reasoning examples for further judging in the data pipeline, including in broader sets of unknown rows.
Risk and downgrade use opposite slices of the same source material:
Every example, in both tasks, has the same shape: an input the model sees and a label it learns to emit. The input is the reconstructed code window (lines), the file extension, and the focus_line index of the line under judgment. label is a binary verdict (S = safe, B = real secret) along with a short reason.
Risk keeps the candidate value in the clear (CredData positives are public and obfuscated); this is what the model has to judge after a scanner miss. In this example we see a TOTP secret that sits in what looks like test code, but the shape looks real enough that it should be flagged for review:
{
"input": {
"extension": "rb",
"lines": [
" it 'meets basic functionality' do",
" otp = ROTP::TOTP.new('ILBLD5WABAUC7QAD')",
" uri = described_class.new(otp, account_name: 'alice@google.com')",
" expect(uri.to_s).to eq 'otpauth://totp/alice%40google.com?secret=GYKNE3FDCUPM3VEZ'",
" end",
"",
" it 'includes issuer' do"
],
"focus_line": 3
},
"label": {
"verdict": "B",
"reason": "Focused line includes an otpauth URI containing a visible 16-character base32-like TOTP secret. Although the surrounding lines are an RSpec test, the value is substantial and credential-shaped, so it warrants warning."
}
}Downgrade is the opposite: the scanner already fired, so every detected secret is masked before the model sees it. The model must decide from context alone whether the masked hit was a real secret or a false alarm:
{
"input": {
"extension": "<none>",
"lines": [
"DATABASE_URL=********************************************/postgres",
""
],
"focus_line": 0
},
"label": {
"verdict": "B",
"reason": "Should remain blocked because the focused line assigns a masked DATABASE_URL value, which is a plausible database connection secret and lacks clear placeholder or non-production context."
}
}For risk, our training set contains 5,000 rows: 2,000 real secrets (40%), 2,000 false alarms (40%), and 1,000 unknowns (20%). The eval set contains 427 real secrets / 427 false alarms. We chose to keep this balanced rather than attempting to mine a production shape for security reasons.
For downgrade, our training set contains 6,776 rows: 4,726 real secrets (69.7%), 1,940 false alarms (28.6%), and 110 unknowns (1.6%). The eval set contains 90 real secrets and 340 false alarms, about 21% real secrets, matching an aggregate production base rate we've sampled.
Training and evaluation mixes are chosen to match a file-type profile derived from aggregate session data, so the models see a ts/py/md/tsx, etc… skew they will meet in real diffs to learn different language shapes surrounding secrets. An LLM judge ensemble (GPT-5.5 as the primary judge, Opus 4.8 adjudicating) assigns the training verdict + reasoning. We ground this in the actual CredData source of truth to minimize potential for any judging errors and to help the fine-tune learn from a deterministic source.
Risk spans the entire CredData taxonomy, skewing toward passwords. This makes sense: passwords don't have any pre-defined entropy or format, so they are harder to deterministically catch:
Share of candidate spans
The downgrade set sees masked scanner hits, so its redacted credentials skew towards those with deterministic shapes. It leans heavily on keys and secrets:
Share of masked spans
On privacy: Real sessions are never training rows. The only production signals used are aggregate class priors that have been sampled across enough distinct sessions to prevent any identifying information. These are used to calibrate the curriculum and evaluation mix without ever exposing user content.
We chose Qwen 3.6 35B A3B as our base model due to its coding strength, reasoning capabilities, and improved cost/latency in comparison to the frontier. For future work, we intend to look into the performance of other even smaller LMs, including Nemotron 3 Nano.
Our final models are a rank-16 LoRA adapter for risk and a rank-64 adapter for downgrade.
ROC-AUC, repo-level holdout
We run evaluations on repo-level holdouts, where whole repositories from the source data are held out of training.
As mentioned earlier, our risk eval's 427 / 427 split is deliberately balanced. No real shape aggregation was attempted, though we do estimate it would match existing secret-detection and PII problem shapes; typically a very small fraction of true positives.
We report the metrics that survive the balanced setup:
Each of these reads only one side of the split, the secrets or the non-secrets, so changing how much of the other side we include doesn't move them. In other words, these metrics are prevalence-invariant; they carry over from the balanced split to whatever the true production rate would be.
Precision and PR-AUC are not as faithful. For example, even at a ~0.1% prevalence a small FPR buries the real secrets in false positives: hold recall at 70% and FPR at 3%, and precision falls from ~96% on the balanced split to ~2% without the model changing at all. Unlike downgrade, whose eval already matches an estimated production base rate, we cannot claim a reliable precision number for risk from this eval.
Fine-tuning improves the classification on our held-out evaluation for both use-cases:
ROC-AUC, base model vs. fine-tuned adapter
In evaluation against the frontier (GPT-5.5 and Opus 4.8 with default reasoning levels):
ROC-AUC vs. the frontier
ROC-AUC vs. the frontier
Both gates optimize for recall, but each uses a different tradeoff mechanism. We place each evaluated model at its own best operating point for the chosen objective for a completely faithful comparison and to account for any implicit differences in reporting confidence.
Risk is recall-first under a false-positive budget according to a Wilson 95% upper bound. We report two false-positive budgets: a strict FPR <= 0.05, and the looser FPR <= 0.10 that the shipped risk gate currently runs at. We hold the rates within budget on this upper bound and take the best recall each model reaches.
At a strict 0.05 budget:
Wilson 95% CI, n = 427
| Model | Recall | 95% CI | FPR |
|---|---|---|---|
| Fine-tuned LoRA | 0.698 | [0.653, 0.740] | 0.028 |
| GPT-5.5 | 0.588 | [0.541, 0.634] | 0.021 |
| Opus 4.8 | 0.574 | [0.526, 0.620] | 0.028 |
| Base Qwen 3.6 | 0.563 | [0.516, 0.610] | 0.023 |
At a more lenient 0.10 budget:
Wilson 95% CI, n = 427
| Model | Recall | 95% CI | FPR |
|---|---|---|---|
| Fine-tuned LoRA | 0.878 | [0.844, 0.906] | 0.070 |
| Opus 4.8 | 0.852 | [0.816, 0.883] | 0.070 |
| GPT-5.5 | 0.707 | [0.662, 0.748] | 0.054 |
| Base Qwen 3.6 | 0.629 | [0.582, 0.674] | 0.066 |
The fine-tuned adapter reaches the highest in-budget recall at both budgets.
Downgrade shares the same recall priority, but the catastrophic error is clearing a real secret the scanner already caught. There's no natural false-positive ceiling to budget against because the scanner already blocked these lines; retaining a false alarm is the status quo, not a new cost. Instead, we score each threshold by net utility: the false alarms it clears minus lambda times the real secrets it wrongly clears.
lambda is an exchange rate: how many correctly cleared false alarms one missed secret is worth. At lambda = x a missed secret costs as much as x good clears. Raising lambda pushes this threshold down and trades more retained false alarms for higher secret recall. We sweep lambda over {1, 2, 5, 10} and ship lambda = 5.
At each model's lambda = 5 operating point on the unbalanced holdout:
Same repo-level holdout
| Model | Recall | Precision | Clears false alarms |
|---|---|---|---|
| Fine-tuned LoRA | 0.856 | 0.405 | 0.668 |
| GPT-5.5 | 0.800 | 0.471 | 0.762 |
| Opus 4.8 | 0.767 | 0.352 | 0.626 |
| Base Qwen 3.6 | 0.589 | 0.327 | 0.679 |
At this cost setting, our adapter retains the most real secrets. GPT-5.5 clears more false alarms and has higher block precision, but at the cost of wrongly attempting to downgrade more true positives.
Because this gate is negative-heavy, with only about 21% real secrets, PR-AUC is also relevant: it captures how concentrated real secrets are among the items the gate would keep blocked:
Negative-heavy holdout
The frontier models remain competitive across the rest of the operating curve, including in the low false-positive region.
Both fine-tunes are available on Hugging Face for the community to run locally, inspect verdicts and reasoning, and continue to build on.
Included are the PEFT LoRA adapter weights, tokenizer and config files, the exact system prompt used for training, and additional guidance for how to read and calibrate the log probabilities in the model outputs.
We're releasing the weights to make this work usable beyond Factory and to reaffirm our commitment to the open source AI community. We welcome any contribution or collaboration in improving these models for everyone.
Risk dataset: We lean on ranking quality to account for limited positives; real secrets are rare and complicated to mine for, so the risk eval is deliberately balanced. We report only prevalence-invariant metrics (recall, FPR, ROC-AUC). The candidate pool is enriched: risk candidates are mined by secret-name keywords plus a thin random sample of non-firing edits, so the pool over-represents deterministic secret-shaped lines rather than semantic similarities.
Downgrade dataset: Because this model has to judge from context where every detected value is masked, there's a hard limit on the signal we can include in the data. Enriching the code window with non-sensitive value shape and entropy remains future work that requires additional production changes. On the negative-heavy holdout, PR-AUC sits at ~0.54 and is essentially tied with GPT-5.5.
Data, evaluation, and privacy: We do not store any scanner or secret detections in our analytics; class priors come from cached judge labels over historic hits, not a live stream. Our eval datasets are entirely public benchmark content re-weighted to this prior. Real private repositories may carry secret and code shapes this fails to represent. Training verdicts come from an LLM judge ensemble anchored on a hand-labeled batch of otherwise-unknown samples; this keeps the labels grounded, but susceptible to inheriting the judge's bias.
Deployment: Each gate is placed at a single threshold cut selected on our fixed evaluation holdout. This does not always hold up in real traffic and we expect to further refine these operating points. The frontier comparison is coarser for closed models: GPT-5.5 and Opus 4.8 don't expose any token probabilities, so we have to tune them on a self-reported confidence rather than the calibrated score we read from our own models. This does introduce noise: we reran the samples with and without the confidence clause and observed matching verdicts on 95.4% of GPT risk rows, 91.3% of GPT downgrade rows, 97.9% of Opus risk rows, and 98.0% of Opus downgrade rows.
We present an improved version of Droid Shield: semantic secret detection from broader code context to augment deterministic scanning. Evaluated on a public benchmark representative of what we expect real traffic to look like, we show that our fine-tuned adapters are competitive with frontier LLMs while being smaller and faster.
As Droids take on more of an organization's commit volume, security is at the forefront of Factory's priorities. A strong protection layer is what keeps autonomous software engineering safe to scale.
Droid Shield 2.0 is currently in research preview. If you'd like access enabled for your organization, please reach out to our team.
If this work excites you, join us!
start building
Start building