💻Video + Code Examples·8 mins

Observability & The Feedback Signal

Nilay Parikh

Make the CleanLoop loop observable. This lesson shows where run history, strategy snapshots, row-level traces, and dashboard metrics come from so you can tell whether a mutation actually taught the system anything.

Thumbnail for Observability & The Feedback Signal — Observability & The Feedback Signal · 8 mins

Transcript21 entries

Instructor:Think of your system like a car dashboard. If you don't see speed, fuel, warnings, you're still moving, but you're moving blind. That's what mutation loops look like without observability. It doesn't improve. It just keeps changing. You need a surface that tells you which difference matters, which pattern keeps failing, and when a human should step in. This lesson builds that surface. This is the difference between engineering and guesswork.

Instructor:Think of this whole course as one system, or one big example. Each lesson adds something new. But the earlier ones are still running underneath. So if you need to brush up those earlier concepts, just click the YouTube card in the top right corner to jump back to the playlist. In earlier lessons, we separated reader, proposal, and crucible inside the orchestrator. AutoGen, the framework we use, seems useful only if we can inspect the artifacts it produces. We are not starting over here. We are

Instructor:just adding observability to receive better feedback signals, so keep the full system in mind. Let's understand the memory of the loop using this simple architecture diagram. What it combines, it combines the run history, strategy state, live evidence, and operator control. Run history should always answer these three questions: is the loop improving? Is it repeating the same mistake, or is it blocking? If it cannot explain the current run, then likely it has no value. That is the core

Instructor:of observability. You must see what changed, which metrics moved, and what decision followed. That's the linkage, and that's a real feedback signal. Even in an autonomous agent role, you should be able to continue, reset, and intervene. Repeated failures often signal missing knowledge, not just missing effort. Observability, in my view, is not just a dashboard. It is a combination of history, agents, and control. Read the artifact like an operator. Don't ask

Instructor:just what happened once. Just ask what happened across the last three runs. What patterns do you see, and what do you do next? This is the skill that this particular lesson builds. So let's go to the hands-on lab and understand. There we are in our Visual Studio Code. If you haven't checked out our example, then please find the GitHub link in the description below. You can start with observability feedback. Like every other lesson, we got this markdown file where to

Instructor:start. It has some catch-up on earlier parts. You can also catch up on the first three lessons if you want. But these are the four important points which I want to discuss. First, observability, treat it always like an external memory. It's one of the most valuable data sets that you will capture. It will help you make your autonomous decisions better and better every time. Most importantly, those mutation and small decisions. So you can let the mutation surface grow and your pipeline become

Instructor:broader and broader. Therefore it will be able to handle far more edge cases than when we started, and that's the real use case of this observability as external memory. Always remember, the score and traces are two different questions, and they should answer two different questions as well. The score answers, did it improve? It's just a metric. It can tell you the holistic view. But the traces will actually tell you what happened to this particular

Instructor:use case, what happened to this particular row, or what happened to this particular application case. For example, why are score traces and row-level traces important? If, let's say, I'm applying for a loan, you have an autonomous agent such as this which processes that decision, multiple agents spread across a distributed architecture. Well, when it onboards, how will it establish that each and every system does know a process scope? It's a business process scope. It's

Instructor:not a system process, so it cannot retrieve. So, the very first onboarding system can generate some sort of correlation ID which will walk through across all the systems. And whenever we want to retrieve that information or observable points, we can actually retrieve using the same correlation ID. The reason is because the application can be processed twice or thrice. We want to see each and every atomic process and how it works, and that's why this is important. Missing

Instructor:artifacts are also very important feedback that one should not miss out. Well, I have added some code anchors. I would strongly recommend you to visit them. We are using most all of our observability platform. In this example, that uses OpenTelemetry, but we are not using the OpenTelemetry back end such as Grafana, Prometheus, or anywhere else because we don't want to add extra burden on the users or learners. So we are just using simple JSON files for persistency. However, in production use cases,

Instructor:someone might use the OpenTelemetry collectors or they might be using a Grafana stack, Datadog stack, any similar proprietary or open-source architecture and stack. The reason is because it will provide better control for production systems. However, for example, this is sufficiently enough. It can store the same architecture, same design, and same attributes in JSON files, and we'll walk that in the dashboard itself. So it stores everything here by default,

Instructor:and we can actually understand what is happening there as well. So let's go back to our code. Yes. So you can always walk through everything that is essential for this trace recorder. You can see and try to observe how we are actually building the trace runs, run IDs, and the correlation ID. That's actually the understanding that we need to build across this whole ensemble. Correlation IDs are absolutely essential and one of the keystones

Instructor:and key skills when we build any observability platform. So let's go back to our documentation. You can actually run using the dashboard, where I'm already running it, and I got my dashboard open. In this dashboard I have done a very simple thing. As you can see, the run, how many runs we have processed. The first page, operator signals, gives you the overall health of this particular dashboard and application. The score timeline will give you

Instructor:the historical comparisons. The run blueprint will basically give you the mutation surface, how many mutation surfaces we got and what changes we made to them. For example, here the deterministic mutation has been added. Then it managed to add the mutation playbook, which is basically the response from the LLM. The deterministic will basically process according to the rules that we have already defined. Once beyond that point, if they do not succeed on those criteria of deterministic,

Instructor:then the mutation will be applied here. Why this is important? Because every time the code has changed, one of the core essential factors that you want to check, not just from observability but also from the compliance point of view, is what your mutated code looked like. Was it dangerous? Was it stable, or was it accurate? Was it biased? All those aspects are absolutely essential to understand, and more importantly, what edge cases it left, and how

Instructor:can I improve this mutation better by providing more information and more context to LLMs? So when the runtime mutation happens, we can cover more and more edge cases. So these are the areas, these are the answers that all dashboard tabs will provide. Each dashboard tab will provide an individual answer. For example, here it will provide what happened to this particular invoice that we are talking about. Let me go and find KNG 209, and you can see it has

Instructor:identified it as a category escalated coming from a deterministic row, and it was processed pretty much, and what its input and sidecar rows looked like. It pretty much gives us the clear understanding of what happened. Now let's see 214, which is basically a mutation play. So we can see it wasn't able to process the unmapped token amount. That was the reason why we needed to run the mutation, and it tells exactly what we did with it. And then it also tells the input

Instructor:and the sidecar rows, which is pretty awesome because it comes from two different places, blanks and failures, and then how we managed it. We can also see the trace timeline here. So that's C214 clean data hotel records and its trace IDs as a correlation ID, and what was the score delta. The same thing from an observability point of view, the broader observability, what actually happened. The overall test spans across the runs and execution logs and diagnostics. Well, it's a

Instructor:quite detailed dashboard to walk through, and I don't want to spend 30 minutes walking through the dashboard. But it's very easy to run this dashboard. But I would like you to offline walk through this in detail and as much as possible. Now let's go back to the hands-on exercises. These are very important exercises. This will help you to improve your understanding around dashboards and this particular example. It will also help you when, in a real-world scenario, when you

Instructor:want to add more stores, you want to add more metrics, how you're going to deal with this kind of projects and Software 3.0. So make sure you try the exercises, and I would say please make sure that you do exercises on each and every lesson that we complete. Now let's go back to our presentation and find out what's left there. Here we are back in the presentation. Now the loop has a memory that you can read, control, and justify. That gives you the

Instructor:real feedback signal. Instead of building blind repetition, next we will increase the pressure so the loop doesn't get too comfortable. And that's all for this lesson. Please make sure you subscribe and also press the notification bell icon. So when the next lesson we release, it straight away comes to your timeline and your notification, and I'll see you in the next lesson. Take care.

Generate and Inspect the Feedback Surface

0/3

Write one bounded history artifact

Run one loop iteration so CleanLoop exports the history, strategy, and trace artifacts that the rest of the lesson depends on.

bash

cd _examples/self-improving-agent/cleanloop
python util.py reset
python util.py loop --max-iterations 1

Inspect the saved artifacts directly

Open the history and trace files before you rely on the dashboard so you know exactly which evidence the UI is reading.

bash

code .output/finance_eval_history.json .output/finance_strategy.json .output/traces/row-decisions.jsonl

Review the dashboard like an operator

Launch the Streamlit dashboard and connect score movement, invoice-level traces, and mutation evidence into one review surface.

bash

python util.py dashboard

nilayparikh/tuts-agentic-ai-examples/tree/main/self-improving-agent/cleanloopGitHub

Complete source code for this lesson.

github.com/nilayparikh/tuts-agentic-ai-examples/tree/main/self-improving-agent/cleanloop

Q&A

Q & A

Why does this lesson separate score from trace instead of treating them as one metric surface?

Because they answer different engineering questions. The score tells you whether the run improved overall, while the trace tells you what actually happened to one row, proposal, or failure path. You need both to trust the loop.

Why spend time on correlation IDs in a lesson about a local CleanLoop example?

Because the same reasoning scales to distributed systems. Correlation IDs let you connect one business event across multiple components, which is exactly how you keep agentic systems legible once they spread beyond one file or one process.

Why are missing artifacts considered feedback in this lesson?

Because absence is a signal. If the history, strategy, or trace files never appear, that tells you the run never reached the stage you expected. Observability should help when the system fails early, not only when the happy path works.