Introduction

Once you get to the implementation phase, you can simply ask your agent to implement the task and end it there. The agent will produce the code based on the specifications. If that’s all you do, then you still have the job of going through the validation steps required to ensure the code is ready for production, such as running unit tests, locally testing functionality to check it works and all the other things a good engineer needs to do to ensure the code is ready for production.

If that’s all you do, then you are not using agentic workflows to the full potential. The real benefit of agentic engineeering is that you can integrate all of these steps as part of your workflow. We do this by giving the agent clear instructions during the specification phase of the required validation steps before marking the task complete. The validation workflow will ensure generated code meets the existing quality standards required for any code to go into production.

There is one big catch. To make this work, the agent needs to have the same level of access as you do when you when you are validating any code. Anything you do as part of the validation process, such as running tests, manual checks, UI checks etc, the agent should also be able to do. This means we really need to make an effort to ensure our project can be ran locally, and that the agent has all the necessary tools to run all of the validations locally.

With this feedback loop the agent has all the tools it needs to gather its own feedback and implement a feature end to end - from code generation to pull request. The feedback loop will ensure the agent can actually validate the code actually runs and achieves its goal, but also it ensures the end code meets the quality standards for any code going into production.


SECTIONS

  • What is a feedback loop? - feedback loop cycle (illustrated), difference between “agent wrote code” and “agent wrote, ran, and checked code works and fixed it based on feedback”
  • Starting the implementation phase - how to prompt, propmt example
  • Setting up a local environment - Why does it need it; What does it mean to run locally (python examples), common blockers (CI/CD tests, lack of permissions)
  • Creating your validation loop - Example based - tests, custom scripts, when you need to go manual, choosing the right validations (optimising the solution)
  • Giving agent access to CLIs and MCPs - (refer to full chapter for WHAT is CLI & MCP) how they expand your agents capabilities (example of Databricks) - Refer to MCP and CLIs chapter
  • Scoping permissions safely - agents run with your credentials, access to dev environments only, do not connect to prod (dev & QA only), Least permissions approach (what if agent goes rogue)
  • Sandbox environments - Why you need it (safety + autopilot), what does it look like, how to setup one
  • Manual edits vs Fixing forward
  • Ownership and the pull request

What is a feedback loop

When you as an engineer implement any code, you have to go through some steps to ensure the code you’ve implemented actually works and does the thing it is supposed to do. To check this, you usually run the code to check it actually runs. You may also run the unit tests, to check the new code doesn’t break any other parts of the system, and you may also write some tests for this new code. If you are building an application with a user interface, you may run the application locally and interact with it, to ensure the User interface looks correct, and that everything works as expected.

All of these things are different types of feedback that we use to check if our code was implemented correctly. Every time we look for feedback, we check if it meets our expectations. If it doesn’t we use this feedback to fix our code so that it works as intended. We go through this feedback loop as many times as needed to ensure the code standards meet the expectations to call our work “done” and ready for production.

When it comes to agents, the idea is the same. In our project, we set the different checks that the agent needs to go through before it can mark any task as complete. Every time the agent writes a piece of code, it goes through all the checks. If any of them fails, the agent looks at the feedback and fixes the code. The agent repeats this process until all checks pass.

Let’s take a look at these two approaches to implement a feature using agentic engineering. In the first image, we use the agent to implement the code, but there is no validation. In this case, once the code is implemented, it is up to us to run all the checks to ensure everything works. If any of the checks fail, it is up to us to re-propmt the agent with the error message so that it can go back and fix the code.

flowchart LR
    A[Spec + Tasks] --> B[Agent Implements]
    B --> C[Human Tests Manually]
    C --> D{Issues Found?}
    D -- Yes --> E[Copy/Paste Error to Agent]
    E --> B
    D -- No --> F[Done]

    style C fill:#f9a8a8,stroke:#333,stroke-width:2px
    style E fill:#f9a8a8,stroke:#333,stroke-width:2px

With this workflow, the engineer becomes the feedback loop. We have to be constantly checking every time the agent finishes running so that we can run all the necessary checks. As you can imagine, this is very time consuming, and not a very fun task for an engineer to do. I’m sure you have better things to do than to copy back and forth.

Another big issue with this approach, is that you are not giving the agent the full context when you provide feedback. When copy pasting an error message, or any logs from a failed check, you may be omitting some crucial information about the check that failed.

On the other hand, if the agent has all the necessary tools to run its own checks and gather its own feedback, it can easily check everytime it needs to validate if the generated solution passes all the checks to call something done. It completely removes the human in the loop, completely automating the code implementation and validation process end to end. It will ensure the solutions meet the standards set by the checks. By having access to the validations, the agent has all the context it needs when looking at the feedback to ensure it fixes the code to a working solution.

flowchart LR
    A[Spec + Tasks] --> B[Agent Implements]
    B --> C[Agent Runs Validation]
    C --> D{Issues Found?}
    D -- Yes --> E[Agent Fixes Code]
    E --> B
    D -- No --> F[Done - Ready for PR]

The agent implements, validates its own work, and iterates until all checks pass. No copy/paste. No human in the loop. The agent has full context on every failure and can fix issues autonomously.

The implementation phase

The implementation phase begins when we’ve collected all of the specifications in a document and we are ready to start implementing. At this point all of the requirements should be gathered and the specifications should be clear and detailed.

If you follow a spec driven development approach, all you need to do is ask the agent to implement the tasks. If you are not using a spec driven approach then I’d recommend you break down your specifications/plan into concrete tasks the agent can follow so that there is very little wiggle room for the agent to improvise.

The feedback loop should be part of your plan, added to the success criteria for any new functionality created. When creating an implementation plan, your agent should be aware of all of the commands/scripts that can be used for validation, such as how to run unit tests or run you project locally. If you have any scripts you use for testing, your agent should know about these ones too. When you break down you plan into tasks, the validation criteria should be added each time a feature is implemented

This is as example of what your tasks.md file containing the concrete tasks might look like:

## Tasks: Implement POST /orders endpoint

### Task 1: Create the endpoint skeleton
- Create POST /orders endpoint in routes/orders.py
- Define request schema: order_id, customer_id, items (list), total_amount
- Define response schema: order_id, status, created_at
- Return 201 on success, 400 on invalid request
- [Testing] Run `pytest tests/test_orders_routes.py::test_create_order_schema` - must pass
- [Testing] Run `make lint` - must pass

### Task 2: Implement database persistence
- Create orders table if not exists (id, customer_id, status, total_amount, created_at)
- Insert order record on POST request
- Return persisted order data in response
- [Testing] Run `pytest tests/test_orders_repository.py` - must pass
- [Testing] Run `make db-migrate` - must complete without errors

### Task 3: Add input validation
- Validate customer_id exists (call GET /customers/{id})
- Validate items list is not empty
- Validate total_amount matches sum of items
- Return 422 on validation failure with error details
- [Testing] Run `pytest tests/test_orders_validation.py` - must pass
- [Testing] Run `python scripts/test_validation_scenarios.py` - all 12 scenarios must pass

### Task 4: Add error handling and logging
- Wrap database operations in try/catch
- Log all order creation attempts with correlation ID
- Return 500 with error ID on internal failures
- [Testing] Run `pytest tests/test_orders_errors.py` - must pass
- [Testing] Check logs in `logs/orders.log` - correlation ID present on all entries

### Task 5: End-to-end validation
- Call POST /orders with valid payload via curl
- Verify response contains all required fields
- Verify record exists in database
- Verify log entry was created
- [Testing] Run `python scripts/e2e_test_orders.py` - all checks must pass
- [Testing] Run `make test-e2e` - must pass

In the example above, each task has its own validations and tests. We don’t wait untlil the end to create to validate our code.

flowchart LR
    P1[Phase 1] --> V1[Validate]
    V1 --> P2[Phase 2] --> V2[Validate]
    V2 --> P3[Phase 3] --> V3[Validate]
    V3 --> Done[Done]

    style V1 fill:#90EE90,stroke:#333
    style V2 fill:#90EE90,stroke:#333
    style V3 fill:#90EE90,stroke:#333

Each phase implements → validates → passes before the next phase begins. Failed validation means the agent fixes the issue based on the feedback before moving on to the next task. With this approach we ensure a valid implementation for each task, so that we don’t build downstream features using the wrong foundations.

If our implementation plan is large (we reccommend building smaller features that are easy to understand and review), then it may be benefitial to implement the tasks in phases. When implementing incrementally, we can ensure each phase has been implemented correctly, and we can make any changes early on if we detect that something is being implemented differently than we expected. This makes it easier to build on top of correct features, rather than building the whole thing and finding out at the end that the whole thing was built under the wroing foundation.

While this phased approach goes against the philosophy of fully automated end to end engineering without a human in the loop, with this process we can ensure the correct implementation is being follewed for all steps of the process.

The implementation prompt can be as simple as

Implement all of the tasks in the task document

You can change it depending on whether you are implementing all tasks in one go, or one at a time.

Setting up a local development setup

Any time you are building software you need to be able to validate it before it goes into production. At the very least, for any software written, you need to check that it actually runs, otherwise the program won’t work and deploying it will be pointless.

There are many ways you can validate your projects. Depending on the application and the team, you may have things running locally. This is the fastest way to get feedback, because you can run things in your local setup.

Other times, teams have validations in their CI/CD pipelines, or in an external dev or QA environment. This is also valid, and the reason is usually because if there are cloud deployments it can be easier to test them in a dev environment deployed in the cloud instead than in your local system.

The biggest issue with testing in a higher environment like a dev or QA environment is that we normally have to wait some time for our code to be deployed, this can go from a few minutes to hours, deperding on the application. This makes the feedback collection a very slow task.

Hierarchy of testing

  • Locally - basic checks to ensure code runs. Basic unit testing and local servers
  • deployed dev environment - Useful when our application has other dependencies that are also deployed in the cloud
  • QA or preprod environment - Similar to the dev environment, except this one is very similar to the prod environment than dev. In dev we usually have unlimited permissions to try things out.
  • Prod environment - user facing environment. Only code thoroughly tested in the lower environments actually makes it to prod.

The most efficient way to build code reliably and efficiently is to be able to run as much testing as possible in our local environment. If we can make our local enviroment as capable as the deployed dev environment where the application can run and connect to the other dependencies - such as databases, api gateways etc - then we will be in a very good position to have a good agentic engineering setup with a strong feedback loop.

Each project is different, and will have different checks you want to run. But at a surface level, these are the things you want to be able to run in you local environment.

  • Unit tests: all the tests should be easily runnable using a simple command. If we have tests that depend on external systems, we should have all the necessary mocks.
  • End to End tests: scripts that test the functionality end to end. For example if we have an API, which has an API gateway and also connects to a database, we should be able to run and end to end test via the API gateway, through the API service, all the way to the database and back. No Mocks.
  • We should be able to run the application locally to run any manual checks. If our application needs debugging, we should be able to easily run it locally and debug.

Anything that we can’t run locally and we need to rely deploying to higher envoironments to test, is going to make it more difficult for the agent to gather the feedback it needs to ensure the correct implementation of features. The less feedback it can get, the less reliable the final implementation will be, and the more manual checks, you, the developer will have to run.

Creating a validation loop for your project

If we want the agent to develop features end to end with minimal developer interaction, we need to setup a feedback loop. A feedboack loop is what gives the agent information about the implementation that tells it if has implemented the correct solution or not. Based on the feedback it can go back and improve the solution further.

Setting up the correct feedback loop for our project will be challenging. A good excercise to understand what feedback is needed, is to think of all of the things you check before you make a PR. These can be things like code readability, unit testing, end to end tests, manual tests, UI rendering and so on. The steps you go through will depend on the projects you work on. A Software engineer will have a different process than a Data Enigneer who builds pipelines.

Once you have identified all of the things that you use to give you feedback when you implement a new feature, write them down. Next for each, try to find out if this is something the model already has access to or not. For example, can you run testing locally or do you need to test in a lower environment. Does the agent have access to this environment for testing? If the answer is no, the agent does not have access, then you need to figure out how to give the agent access to that system, so that it can easily run tests to gather feedback. For every thing in your list, ideally the agent should be able to run it locally. However, that is not sustainable for some projects or tasks. For example, Data Engineers may need to test pipelines in the data warehause because of the size of the data is too much for local envirorments. In that case, the best scenario is to give the agent access to command line tools and MCP servers to be able to run its own testing in the external system.

The best feedback loop is the one that has as many elements of feedback as you would use as developer to ensure your Pull Request is ready to be reviewed and shipped. If any of these elements of feedback is missing, for example if you are building a web application and the agent cannot validate how it renders features in the User interfarce, then your agent won’t be able to validate and optimise towords the right solution in that regard. If the agent is able to write code, run unit test, manual end to end tests but is not able to verify if features render correctly, then the agent won’t be able to optimise or fix issues in the interface and it will require human interaction to fix.

If you want truly autonomous agents that require very little humon interaction, you will need to find all of the tests and processes that need to pass, as well as the passing standards and make sure the agent has access to them to get its own feedback and provide the best final solution. Failing to provide the correct feedback tools, or providing the tools with the wrong feedbock will result in agents that will not provide complete or accurate solution, which will require human interaction, defeating the puprose of fully automoted agents.

[We need some sort of visualisation]

Give the agent more context with MCPs and CLIs

Scoping your environments safely

Sandbox environments

Why sandbox your agent

Running agents in Docker containers

Running agents in VMs

Managed sandbox tools

Choosing the right approach for your team

Manual edits vs Fixing Forward (when your specs fail)

Sometimes, after implementation is finished, you notice that something was built incorrectly or not built at all. This is usually a symptom of one of two things: a poorly defined spec (the agent solved the wrong problem), or a weak feedback loop (the agent never got signal that something was wrong).

When this happens, there is a temptation to start firing off quick fixes. “Just change this bit”, “actually make it do this instead”. This is where spec driven development quietly becomes vibe coding. Each prompt narrows the context a little more, the model loses track of the bigger picture, and the quality of the output degrades with each exchange.

A useful rule of thumb: the fewer prompts it takes to implement something, the better the result. Quality and prompt count tend to move in opposite directions.

%%{init: { 'theme': 'base', 'themeVariables': { 'xyChart': { 'plotColorPalette': '#e11d48' } } }}%%
xychart-beta
    title "Code Quality vs Number of Prompts"
    x-axis "Number of Prompts" ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10"]
    y-axis "Code Quality (%)" 0 --> 100
    line [95, 80, 60, 40, 25, 15, 10, 7, 5, 5]

When you spot issues after implementation, you have two options depending on the severity.

Manual edits

If the issues are small and localised, a missed edge case or a minor behaviour that’s slightly off, the pragmatic move is to review the code yourself and fix it by hand. This is faster than re-running the full workflow for something trivial, and it keeps you close to the code.

Fixing forward

If the issues are more significant, resist the urge to prompt your way out of them. Instead, treat it as a new iteration. Write a new spec that clearly describes the gap or the incorrect behaviour, go through the full spec → plan → tasks workflow, and implement the fix cleanly in one pass.

The key principle here is that spec, plan, and tasks files are immutable once you have moved past them. You do not go back and edit an earlier document mid-flight, as that cascades changes unpredictably across the whole workflow. Once you are happy with a phase and move forward, it is locked. If something is missing, you either finish what you have and create a new spec for the next iteration, or you abandon the current cycle and start over from the spec.

Fixing forward keeps the problem well-defined, gives the agent a clean starting point, and avoids the compounding drift that comes from patching a half-finished implementation.

Ownership and the pull request

With a solid spec, a good feedback loop, and a capable agent, you can go from spec to pull request without writing a single line of code yourself. The agent builds it, tests it, and validates it end to end.

But this only gets you 95% of the way there.

No matter how detailed your spec is, the agent can and will misinterpret something. It can be small things, but they still need to be caught and corrected. That last 5% is where you come in, and it is arguably the most important part of the whole process.

The agent writes the code. You own it. There is a difference.

Owning the code means you can vouch for every line that goes to production. If someone asks you why a particular approach was taken, you should be able to answer, after all it was you who defined the specs and made the decisions. “The LLM wrote that part” is not an acceptable answer in a production system, and it is not fair to your reviewers to hand them code you have not personally understood. It is not the reviewer’s responsibility to decypher code you have not checked.

This is one of the biggest shifts in agentic development. That accountability is what separates professional engineering from vibe coding.

Do this:

  • Go through every change before raising the PR. Read the diff. Understand what was built and why.
  • Check that the implementation matches the intent of the spec, not just that the code runs or the tests pass.
  • Be able to explain any piece of the code if someone challenges it in review. If you cannot, that is a red flag.
  • Keep PRs small. The smaller the change, the easier it is to review thoroughly and the easier it is for your colleagues to give useful feedback.

Do not do this:

  • Do not raise a PR for code you have not personally read. It is not the reviewers job to be the gatekeeper of code.
  • Do not assume passing tests means the feature is correct. Tests validate what was written, not what was intended.
  • Do not trust the agent to catch its own security issues. Agents will suggest things that are common but not necessarily secure. Long-lived tokens, overly broad permissions, accidentally committed .env files. These get through if you are not paying attention.

Resources