Skip to main content
bendover cli

Bendover: Encoding Enterprise Judgment into Agentic Coding

In Progress
Agentic Prompt Optimization Judgment

Bendover is an AI-native coding system designed to operate inside an enterprise production codebase. It executes coding tasks under a controlled orchestration loop, exposes a constrained capability surface aligned with repository standards, and enforces an explicit, evolving coding policy.

The primary goal of the system is to leverage prompt optimization to create a more reliable coding agent. To prove that this optimization loop functions under real-world conditions, the scaffolding required to run it had to be built and exercised against a real repository. Consequently, Bendover is a bootstrapped framework where, to date, over 95% of the codebase has been written agentically. Bendover has recently begun contributing to its own development. This was a structural necessity to reach the threshold where the system can run on itself, be evaluated mechanically, and treat prompt optimization as a deterministic engineering problem.

Why general-purpose coding agents struggle in enterprise environments

In greenfield or small repositories, agents are often allowed to control their own loop. They decide which files to inspect, which commands to run, when validation occurs, and how to recover from failure. The orchestration is prompt-driven and optimized for task completion.

Enterprise environments behave differently. Mature repositories already have deterministic systems embedded in them: builds, linters, test suites, security checks, and architectural boundaries. These systems exist and run independently of any agent. Changes must conform to these established patterns and conventions. These are enforced conditions, not suggestions embedded in prompts.

When general-purpose agents are applied directly to these environments, three failure modes typically surface:

  1. The agent controls the full loop instead of operating inside one.
  2. The agent uses tools that do not reflect enterprise structure.
  3. The agent writes code according to its training data (priors), rather than enterprise policy.

Bendover addresses these issues by restructuring how the model participates in the system.

1. The agent is a component inside the loop

In many agent frameworks, the model proposes a plan, executes it, validates the result, and retries as needed. The loop is implicit inside the agent’s reasoning.

In an enterprise context, that structure introduces ambiguity. When the model owns the loop:

  • Validation depends on prompt interpretation.
  • Enforcement can be bypassed if instructions fall out of the context window.
  • Repository state can drift from what the agent believes it to be.

In an enterprise setting, this should be inverted. The loop already exists, and the model must be inserted into that loop as a component.

Bendover enforces this separation through a deterministic orchestrator. During execution, the agent is restricted to single-action turns. The loop requires the agent to output a body-only script. C# (CSX) was chosen for this action surface because its strong typing provides rigorous compile-time validation boundaries and reliable Abstract Syntax Tree (AST) parsing before execution. While the current architecture enforces a single-action constraint to establish strict observability baselines, the system will evolve to support multi-step scripting against the toolkit as it matures.

The orchestrator aggressively validates the script using the AST parser to ensure it contains exactly one action, no markdown fences, and no unapproved directives.

The orchestrator executes the action, captures the telemetry, and feeds the result back to the agent. The agent does not decide whether validation runs; the loop requires it. If an action fails, the orchestrator applies back pressure to handle the state rollback. It catches the error, generates a precise failure digest containing the exit code and output tail, and forces the agent to halt, absorb the deterministic feedback, and course-correct.

The model proposes. The loop enforces.

2. Tools must reflect enterprise structure

General-purpose agents typically operate with primitive tools: shell access, file reads, pattern searches, and generic execution commands. These tools are flexible but loosely structured.

Enterprise repositories are not loosely structured. They encode architectural patterns, directory conventions, ignore rules, and invariants that accumulate over time. If the tool surface does not reflect these constraints, the agent must rediscover them repeatedly through trial and error. Unrestricted shell access also introduces severe nondeterminism and context bloat.

Bendover denies arbitrary shell access. All actions execute inside an isolated Docker container and expose composite SDK operations rather than raw primitives. For example, file enumeration and search constraints are encoded explicitly in the SDK:

/// <summary>
/// Locates files by path substring or wildcard pattern.
/// </summary>
/// <tool_category>discovery</tool_category>
/// <use_instead_of>find</use_instead_of>
/// <result_visibility>Automatically emits full structured locate results.</result_visibility>
/// <param_rule name="pattern">Substring by default; supports '*' and '?' wildcard patterns.</param_rule>
public LocateFileResult LocateFile(string pattern, LocateFileOptions? options = null)
{
    return ExecuteAction(
        action: () => LocateFileInternal(pattern, options),
        returnSelector: result => result);
}

Rather than hand-writing tool specs, Bendover derives tool metadata from the SDK itself and injects it into the prompt, making the executable contract the source of truth.

The entire SDK tool surface is designed for strict observability. Every method automatically routes through an ExecuteAction wrapper that emits deterministic JSON telemetry. This abstraction controls execution, standardizes return payloads, protects the context window, and aligns the agent’s capabilities with repository standards.

3. Training data is not enterprise policy

Large models are trained on public repositories, which encode conventions that vary widely. Enterprises converge on local patterns shaped by team composition, risk tolerance, and historical decisions. If the agent writes code based solely on training priors, it will drift toward internet conventions.

Bendover addresses this through explicit, evolving coding policies called Practices. Practices are modular constraints describing how the enterprise expects code to be written (e.g., coding_style.md, external_io_abstractions.md). A routing agent selectively injects relevant practices into the execution loop.

Consider a task requiring the agent to ignore generated script files and temporary directories. Relying on its training data, the agent generates functional but stylistically incorrect code using chained equality checks:

--- a/src/Bendover.SDK/BendoverSDK.cs
+++ b/src/Bendover.SDK/BendoverSDK.cs
@@ -451,7 +451,8 @@ public sealed class BendoverSDK
              foreach (var directory in directories)
              {
                  var name = Path.GetFileName(directory);
+                if (string.Equals(name, ".git", StringComparison.OrdinalIgnoreCase)
+                    || string.Equals(name, "tmp", StringComparison.OrdinalIgnoreCase))
                  {
                      continue;
                  }
@@ -471,7 +472,15 @@ public sealed class BendoverSDK
 
              foreach (var file in files)
              {
-                yield return ToWorkspaceRelativePath(file);
+                var relativePath = ToWorkspaceRelativePath(file);
+                if (string.Equals(relativePath, "script_body.csx", StringComparison.OrdinalIgnoreCase)
+                    || string.Equals(relativePath, "script_result.json", StringComparison.OrdinalIgnoreCase)
+                    || relativePath.StartsWith("tmp/", StringComparison.OrdinalIgnoreCase))
+                {
+                    continue;
+                }
+
+                yield return relativePath;
              }
          }
      }

An experienced engineer recognizes this as an anti-pattern for set membership. Chained || operators increase cyclomatic complexity, scale poorly as requirements change, and conflate data with logic. The enterprise expectation is to model fixed membership lists as static readonly sets, preferring HashSet<T> for O(1) lookup performance and cleaner maintenance.

To enforce this, an explicit baseline policy is established:

---
Name: coding_style
TargetRole: Engineer
AreaOfConcern: Code Style
---

Never use chained `||` equality comparisons to represent membership in a fixed group. Model fixed membership lists as `static readonly HashSet<T>` and use `Contains(...)`. Preserve required case-insensitive behavior.

When Bendover evaluates the run against this policy, the evaluation engine detects the anti-pattern in the diff. It produces a structured contract attributing the failure directly to the coding_style practice:

{
  "pass": false,
  "score": 0.7,
  "practice_attribution": {
    "offending_practices": [
      "coding_style"
    ],
    "notes_by_practice": {
      "coding_style": [
        "Avoid chained '||' equality checks for membership. Use HashSet<T> with Contains(...)."
      ]
    }
  }
}

Because the failure is explicitly attributed, a prompt optimization pipeline reads this feedback and mutates the coding_style.md practice.


Deterministic evaluator scoring and passing the updated practice bundle

It refines the policy instruction until it is rigid enough to override the model’s baseline training. Upon replay with the optimized practice, the agent produces the correct enterprise pattern:

--- a/src/Bendover.SDK/BendoverSDK.cs
+++ b/src/Bendover.SDK/BendoverSDK.cs
@@ -7,6 +7,18 @@ namespace Bendover.SDK;
 
 public sealed class BendoverSDK
 {
+    private static readonly HashSet<string> IgnoredDirectoryNames = new(StringComparer.OrdinalIgnoreCase)
+    {
+        ".git",
+        "tmp"
+    };
+
+    private static readonly HashSet<string> IgnoredFileNames = new(StringComparer.OrdinalIgnoreCase)
+    {
+        "script_body.csx",
+        "script_result.json"
+    };
+
     private readonly ISdkActionEventSink _eventSink;
 
@@ -451,7 +463,7 @@ public sealed class BendoverSDK
              foreach (var directory in directories)
              {
                  var name = Path.GetFileName(directory);
-                if (string.Equals(name, ".git", StringComparison.OrdinalIgnoreCase))
+                if (IgnoredDirectoryNames.Contains(name))
                  {
                      continue;
                  }
@@ -471,6 +483,12 @@ public sealed class BendoverSDK
 
              foreach (var file in files)
              {
+                var fileName = Path.GetFileName(file);
+                if (IgnoredFileNames.Contains(fileName))
+                {
+                    continue;
+                }
+
                 yield return ToWorkspaceRelativePath(file);
              }
          }

The coding-style violation is eliminated. The agent behavior shifts because the operational environment changed, not because the model was retrained.

What the System Enables

With this structure in place, a single engineer can operate at a higher level of abstraction.

The human defines:

  • The goal.
  • The architectural constraints.
  • The acceptable validation criteria.

Bendover executes:

  • Repository inspection.
  • Code modification.
  • Build and validation cycles.
  • Structured retries.
  • Artifact recording.

The result is a validated diff that conforms to structural and policy constraints. The engineer does not review every generated line; they review outcomes against architectural intent. Implementation becomes automated execution within boundaries.

Scaling constraints

Bendover does not decide architectural direction. It enforces chosen direction. Moving from a modular monolith to microservices is a business decision shaped by cost, risk, and team capability. The system can enforce a chosen architecture; it should not select one. Similarly, prioritization, tradeoffs, and long-term bets remain human responsibilities.

As this system scales, the primary risk is policy drift. As practices are incrementally mutated to improve local scores, they may weaken global coherence. Local optimization can converge on a narrow success metric that erodes architectural integrity over time, where the model perfectly follows internal policy but loses general engineering quality.

Mitigating this requires regression-style evaluation across historical runs. Policy changes must be tested against prior goals to prevent convergence on a local maximum. Executing continuous regression testing across an extensive harness implies dedicated evaluation passes and significant compute budgets for policy stability.

Thanks for reading

In enterprise environments, the loop enforces validation, capability boundaries, and policy compliance. The question is no longer whether a model can generate code. The question is whether we can control how it generates code inside our system.

That distinction determines whether an agent remains a tool or becomes infrastructure.