Factory.ai

Deferred Context Engine

By Factory Research, Shashank Sharma - May 20, 2026 - 4 minute read -

Share

Engineering

Research

New

Droid keeps internal tools, MCP tools, skills, and plugins reachable without loading every schema on every turn. Production telemetry quantifies estimated input-token savings by catalog size.

Droid's context engine now loads internal tool schemas, MCP tool schemas, skills, and plugin instructions only when a task needs them. Factory's Deferred Context Engine reduces input context by keeping unused schemas and instructions out of the prompt until execution requires them.

Deferred context has been running in production. Over the past five days, among measured Droid turns in sessions that triggered MCP tools, it cut estimated input tokens by 15.1% on average and reached a 39.4% p90 reduction. Sessions with 100+ hidden deferred tools reached a 50.8% average reduction.

Droid still keeps the full context graph reachable: system instructions, repository rules, memories, internal tools, MCP tools, custom skills, plugins, subagents, prior messages, and live artifacts. It starts with compact discovery metadata. When a task needs a capability, Droid loads the relevant schema or instruction set; everything else stays out of the prompt.

The context bloat problem

MCP expands Droid's execution surface. A single session can connect to Sentry, Linear, GitHub, Figma, Playwright, Notion, Stripe, Vercel, Supabase, and a private internal registry. At the 100+ tool reduction rate, that enterprise stack is roughly 330 public MCP tools: about 47K schema tokens.

Skills and plugins add another layer. A skill packages a workflow: how to triage a Linear issue, run a QA flow, debug a Sentry alert, or implement a frontend surface using a team's design system. A plugin can bundle many skills, commands, MCP servers, hooks, and specialized Droids.

Droid keeps those capabilities reachable without loading all of them into every prompt. Naive loading turns the prompt into an unfiltered capability manifest: every tool schema, skill instruction, plugin capability, and an implicit request to ignore most of it.

The Deferred Context Engine

The Deferred Context Engine uses progressive disclosure to separate discovery from execution.

On each turn, Droid keeps a compact capability index: tool names, short descriptions, server names, and enough input hints to decide whether a capability is relevant. Full schemas and long-form instructions stay deferred until Droid calls a loader, promotes the capability, and keeps it available for the rest of the work.

How it works

Deferred context uses the same pattern across internal tools, MCP tools, skills, and plugins:

  • Discover: Startup context carries compact metadata: names, sources, short descriptions, and enough hints to decide relevance.
  • Promote: When a task needs a hidden capability, Droid loads the full schema or instruction set through built-in context expansion tools.
  • Reuse: Loaded capabilities stay exposed for the rest of the work. Frequently used internal and MCP tools stay warm in the tool cache across sessions; long-tail capabilities remain deferred.

Teams can install large tool catalogs, MCP servers, workflow skills, and plugin bundles without placing every schema and playbook in every prompt. The baseline is what Droid would have carried if all capabilities were loaded upfront. The net context is what Droid actually carries after deferral.

Production telemetry shows sparse MCP execution. 16.6% of telemetry sessions started MCP servers, but only 5.4% executed an MCP tool. MCP access still matters when users need it, but most sessions do not need every internal or MCP schema loaded upfront. The cache handles recurring cases; deferral handles the long tail.

Production results

For the accompanying paper, Factory measured the Deferred Context Engine across production Droid telemetry from the past five days.

We report results for sessions that triggered MCP tools because those are the professional enterprise setups Deferred Context Engine is optimized for. We bucket those sessions by hidden deferred tool count because catalog size drives estimated input-token savings. The count includes internal and MCP tools. Raw pre-load and post-load comparisons mix different cohorts: post-load turns come disproportionately from users with larger deferred tool and plugin catalogs. Bucketing by hidden tool count controls for that and shows the curve directly.

Bar chart showing estimated input-token savings rising with hidden tool count

The savings scale with catalog size. Small catalogs see little benefit; large catalogs avoid carrying thousands of unused schema tokens each turn. At 20-50 hidden tools, MCP-triggered sessions reduced estimated input tokens by 21.0% on average. At 100+ hidden tools, the average reduction reached 50.8%. Naive loading processes more context as the graph grows; demand-driven loading gets more valuable.

The all-turns aggregate is intentionally not the headline: it includes small sessions, empty deferred catalogs, and reminder overhead. When bucketed by catalog size, the largest reductions came from sessions with larger internal tool, MCP, and plugin surfaces.

Fewer input tokens reduce selection noise

Fewer input tokens reduce fresh model processing, latency, and selection noise.

Prompt caching makes repeated context cheaper to process, but irrelevant tool definitions still occupy the model's working set. Even with caching, the model still has to process those tool definitions enough to determine they are unrelated to the task.

Large tool catalogs create three failure modes:

  1. Attention dilution: Relevant files, requirements, and user instructions compete with unused schemas.
  2. Tool-selection noise: Similar tool names and parameter shapes increase the chance of selecting the wrong capability.
  3. Earlier compression: Static context fills the window faster, so long tasks hit compression sooner.

Deferral addresses all three: the model sees fewer irrelevant schemas, tool choice becomes a smaller classification problem, and more of the context window remains available for the task itself: files, errors, decisions, and test results.

Compression is where long tasks pay for excess context. Every compression pass has to condense decisions, errors, and partial results into a smaller summary. Hit that boundary more often, and the agent has less raw evidence to work from.

What changes for users

  • Add internal and MCP tools without loading every schema on every turn.
  • Install specialized skills without injecting every playbook into every task.
  • Distribute plugins without making each user carry the full plugin catalog in the prompt.
  • Keep frequent internal and MCP tools warm in the cache and leave long-tail tools deferred.
  • Reduce input-token usage for MCP-triggered enterprise sessions by roughly 15% on average.

This works best when skills stay narrow and outcome-focused, and plugins package coherent workflows, not arbitrary instructions. Users should not have to manually enable and disable MCP servers just to conserve context. Droid defers unused MCP schemas automatically.

Deferred context keeps larger catalogs reachable without linear prompt growth: discovery metadata stays upfront, and full schemas load only when needed.

Try Droid.

start building

Ready to build the software of the future?

Start building

Arrow Right Icon