Encoding Enterprise Judgment into Agentic Coding

Mystro is an AI-native coding system designed to operate inside an enterprise production codebase. It executes coding tasks under a controlled orchestration loop, exposes a constrained capability surface aligned with repository standards, and enforces an explicit, evolving coding policy.

The primary goal of the system is to leverage prompt optimization to create a more reliable coding agent. To prove that this optimization loop functions under real-world conditions, the scaffolding required to run it had to be built and exercised against a real repository. Consequently, Mystro is a bootstrapped framework where, to date, over 95% of the codebase has been written agentically. Mystro has recently begun contributing to its own development. This was a structural necessity to reach the threshold where the system can run on itself, be evaluated mechanically, and treat prompt optimization as a deterministic engineering problem.

Why general-purpose coding agents struggle in enterprise environments

In greenfield or small repositories, agents are often allowed to control their own loop. They decide which files to inspect, which commands to run, when validation occurs, and how to recover from failure. The orchestration is prompt-driven and optimized for task completion.

Enterprise environments behave differently. Mature repositories already have deterministic systems embedded in them: builds, linters, test suites, security checks, and architectural boundaries. These systems exist and run independently of any agent. Changes must conform to these established patterns and conventions. These are enforced conditions, not suggestions embedded in prompts.

When general-purpose agents are applied directly to these environments, three failure modes typically surface:

The agent controls the full loop instead of operating inside one.
The agent uses tools that do not reflect enterprise structure.
The agent writes code according to its training data (priors), rather than enterprise policy.

Mystro addresses these issues by restructuring how the model participates in the system.

1. The agent is a component inside the loop

In many agent frameworks, the model proposes a plan, executes it, validates the result, and retries as needed. The loop is implicit inside the agent’s reasoning.

In an enterprise context, that structure introduces ambiguity. When the model owns the loop:

Validation depends on prompt interpretation.
Enforcement can be bypassed if instructions fall out of the context window.
Repository state can drift from what the agent believes it to be.

In an enterprise setting, this should be inverted. The loop already exists, and the model must be inserted into that loop as a component.

Mystro enforces this separation through a deterministic orchestrator. During execution, the agent is restricted to single-action turns. The loop requires the agent to output a body-only script. C# (CSX) was chosen for this action surface because its strong typing provides rigorous compile-time validation boundaries and reliable Abstract Syntax Tree (AST) parsing before execution. While the current architecture enforces a single-action constraint to establish strict observability baselines, the system will evolve to support multi-step scripting against the toolkit as it matures.

The orchestrator aggressively validates the script using the AST parser to ensure it contains exactly one action, no markdown fences, and no unapproved directives.

The orchestrator executes the action, captures the telemetry, and feeds the result back to the agent. The agent does not decide whether validation runs; the loop requires it. If an action fails, the orchestrator applies back pressure to handle the state rollback. It catches the error, generates a precise failure digest containing the exit code and output tail, and forces the agent to halt, absorb the deterministic feedback, and course-correct.

The model proposes. The loop enforces.

2. Tools must reflect enterprise structure

General-purpose agents typically operate with primitive tools: shell access, file reads, pattern searches, and generic execution commands. These tools are flexible but loosely structured.

Enterprise repositories are not loosely structured. They encode architectural patterns, directory conventions, ignore rules, and invariants that accumulate over time. If the tool surface does not reflect these constraints, the agent must rediscover them repeatedly through trial and error. Unrestricted shell access also introduces severe nondeterminism and context bloat.

Mystro denies arbitrary shell access. All actions execute inside an isolated Docker container and expose composite SDK operations rather than raw primitives. For example, file enumeration and search constraints are encoded explicitly in the SDK:

/// <summary>
/// Locates files by path substring or wildcard pattern.
/// </summary>
/// <tool_category>discovery</tool_category>
/// <use_instead_of>find</use_instead_of>
/// <result_visibility>Automatically emits full structured locate results.</result_visibility>
/// <param_rule name="pattern">Substring by default; supports '*' and '?' wildcard patterns.</param_rule>
public LocateFileResult LocateFile(string pattern, LocateFileOptions? options = null)
{
    return ExecuteAction(
        action: () => LocateFileInternal(pattern, options),
        returnSelector: result => result);
}

Rather than hand-writing tool specs, Mystro derives tool metadata from the SDK itself and injects it into the prompt, making the executable contract the source of truth.

The entire SDK tool surface is designed for strict observability. Every method automatically routes through an ExecuteAction wrapper that emits deterministic JSON telemetry. This abstraction controls execution, standardizes return payloads, protects the context window, and aligns the agent’s capabilities with repository standards.

3. Training data is not enterprise policy

Large models are trained on public repositories, which encode conventions that vary widely. Enterprises converge on local patterns shaped by team composition, risk tolerance, and historical decisions. If the agent writes code based solely on training priors, it will drift toward internet conventions.

Mystro addresses this through explicit, evolving coding policies called Practices. Practices are modular constraints describing how the enterprise expects code to be written (e.g., coding_style.md, external_io_abstractions.md). A routing agent selectively injects relevant practices into the execution loop.

Consider a task requiring the agent to ignore generated script files and temporary directories. Relying on its training data, the agent generates functional but stylistically incorrect code using chained equality checks:

--- a/src/Mystro.SDK/MystroSDK.cs
+++ b/src/Mystro.SDK/MystroSDK.cs
@@ -451,7 +451,8 @@ public sealed class MystroSDK
              foreach (var directory in directories)
              {
                  var name = Path.GetFileName(directory);
+                if (string.Equals(name, ".git", StringComparison.OrdinalIgnoreCase)
+                    || string.Equals(name, "tmp", StringComparison.OrdinalIgnoreCase))
                  {
                      continue;
                  }
@@ -471,7 +472,15 @@ public sealed class MystroSDK
 
              foreach (var file in files)
              {
-                yield return ToWorkspaceRelativePath(file);
+                var relativePath = ToWorkspaceRelativePath(file);
+                if (string.Equals(relativePath, "script_body.csx", StringComparison.OrdinalIgnoreCase)
+                    || string.Equals(relativePath, "script_result.json", StringComparison.OrdinalIgnoreCase)
+                    || relativePath.StartsWith("tmp/", StringComparison.OrdinalIgnoreCase))
+                {
+                    continue;
+                }
+
+                yield return relativePath;
              }
          }
      }

An experienced engineer recognizes this as an anti-pattern for set membership. Chained || operators increase cyclomatic complexity, scale poorly as requirements change, and conflate data with logic. The enterprise expectation is to model fixed membership lists as static readonly sets, preferring HashSet<T> for O(1) lookup performance and cleaner maintenance.

To enforce this, an explicit baseline policy is established:

---
Name: coding_style
TargetRole: Engineer
AreaOfConcern: Code Style
---

Never use chained `||` equality comparisons to represent membership in a fixed group. Model fixed membership lists as `static readonly HashSet<T>` and use `Contains(...)`. Preserve required case-insensitive behavior.

When Mystro evaluates the run against this policy, the evaluation engine detects the anti-pattern in the diff. It produces a structured contract attributing the failure directly to the coding_style practice:

{
  "pass": false,
  "score": 0.7,
  "practice_attribution": {
    "offending_practices": [
      "coding_style"
    ],
    "notes_by_practice": {
      "coding_style": [
        "Avoid chained '||' equality checks for membership. Use HashSet<T> with Contains(...)."
      ]
    }
  }
}

Because the failure is explicitly attributed, a prompt optimization pipeline reads this feedback and mutates the coding_style.md practice.

Deterministic evaluator scoring and passing the updated practice bundle

It refines the policy instruction until it is rigid enough to override the model’s baseline training. Upon replay with the optimized practice, the agent produces the correct enterprise pattern:

--- a/src/Mystro.SDK/MystroSDK.cs
+++ b/src/Mystro.SDK/MystroSDK.cs
@@ -7,6 +7,18 @@ namespace Mystro.SDK;
 
 public sealed class MystroSDK
 {
+    private static readonly HashSet<string> IgnoredDirectoryNames = new(StringComparer.OrdinalIgnoreCase)
+    {
+        ".git",
+        "tmp"
+    };
+
+    private static readonly HashSet<string> IgnoredFileNames = new(StringComparer.OrdinalIgnoreCase)
+    {
+        "script_body.csx",
+        "script_result.json"
+    };
+
     private readonly ISdkActionEventSink _eventSink;
 
@@ -451,7 +463,7 @@ public sealed class MystroSDK
              foreach (var directory in directories)
              {
                  var name = Path.GetFileName(directory);
-                if (string.Equals(name, ".git", StringComparison.OrdinalIgnoreCase))
+                if (IgnoredDirectoryNames.Contains(name))
                  {
                      continue;
                  }
@@ -471,6 +483,12 @@ public sealed class MystroSDK
 
              foreach (var file in files)
              {
+                var fileName = Path.GetFileName(file);
+                if (IgnoredFileNames.Contains(fileName))
+                {
+                    continue;
+                }
+
                 yield return ToWorkspaceRelativePath(file);
              }
          }

The coding-style violation is eliminated. The agent behavior shifts because the operational environment changed, not because the model was retrained.

What the system enables

With this structure in place, a single engineer can operate at a higher level of abstraction.

The human defines:

The goal.
The architectural constraints.
The acceptable validation criteria.

Mystro executes:

Repository inspection.
Code modification.
Build and validation cycles.
Structured retries.
Artifact recording.

The result is a validated diff that conforms to structural and policy constraints. The engineer does not review every generated line; they review outcomes against architectural intent. Implementation becomes automated execution within boundaries.

What comes next + scaling constraints

There is still substantial work ahead.

Context management

Currently, Mystro does not actively manage context boundaries. It still relies on the agent to determine what should change and where those changes belong based on the user prompt. As the system increasingly contributes to its own evolution, this responsibility will shift from implicit model reasoning toward explicit structural controls.

Orchestration loop

While deterministic in structure, it is not yet rigid enough to fully constrain state transitions and context propagation. As Mystro is used more extensively to modify its own codebase, the loop must evolve to enforce stronger invariants around execution order, rollback guarantees, and environmental isolation.

Practice selection

At present, routing has been simplified due to the limited number of defined practices. As the practice surface expands, more deliberate attribution and selection mechanisms will be required to prevent unnecessary context expansion and cross-practice interference.

Prompt optimization

Prompt optimization remains the largest scaling constraint.

As the system scales, the primary risk is policy drift. Practices are incrementally mutated to improve local evaluation scores, but local optimization can weaken global coherence. Over time, this may converge on a narrow success metric that erodes architectural integrity, where the model perfectly satisfies internal policy yet degrades broader engineering quality.

Mitigation requires regression-style evaluation across historical runs. Policy changes must be tested against prior goals to prevent convergence on a local maximum. Sustaining this level of validation implies dedicated evaluation passes and meaningful compute budgets to preserve long-term policy stability.

Thanks for reading

In enterprise environments, the loop enforces validation, capability boundaries, and policy compliance. The question is no longer whether a model can generate code. The question is whether we can control how it generates code inside our system.

That distinction determines whether an agent remains a tool or becomes infrastructure.