💻Video + Code Examples·8 mins

The Judge & Self-Challenging Loops

Nilay Parikh

Keep the judge fixed while the data gets harder. This lesson shows how CleanLoop generates adversarial finance CSVs, applies targeted pressure, and forces the loop to improve without redefining correctness.

Thumbnail for The Judge & Self-Challenging Loops — The Judge & Self-Challenging Loops · 8 mins

Transcript34 entries

Instructor:Think of your system like a gym. If you only lift light weights, you don't get stronger. You just get comfortable. That's exactly what happens when you loop-train on easy cases. So, we introduce pressure. We add a fixed judge and a smart challenger. One defines the truth, the other raises the difficulty. This lesson shows how to build that tension without letting the system cheat. And hey, if you're serious about building real AI systems like this, make sure you subscribe.

Instructor:This course will get more powerful, more layered, and more interesting. And we have planned more courses like this as well. I want you to think of everything we have built so far as one example, as one real-life application. Each lesson has added a component and capability. But nothing got replaced. It's all running underneath. If you need to revisit any part, check the playlist linked in the top right corner appearing right now, or you can actually go to the GitHub example repo where

Instructor:all the lesson links have been provided. I will reference it again definitely before the lab. In the last session, we built observability that gave us run history, live artifacts, and a feedback surface. Now we are not guessing anymore. We know where the system is weak. So here is the shift. We are not improving randomly. We are applying targeted pressure based on observed weakness. We have seen on a row-by-row basis which cases failed, where mutation is absolutely needed. Same loop, no

Instructor:new force. So don't think this is a new system. This system has all components in place as they were. Observability already existed. Mutation surface already existed, and evolution was already there. We are just making the loop harder to satisfy. I like to think of this like an arena. A judge is a referee. An executor is the current champion. A challenger keeps sending stronger opponents. And the system only improves as long as the champion survives through the fight. This is critical. If

Instructor:the judge changes, you're moving the goalposts. That's the system cheating itself. Correctness must stay constant. This is where most people are likely to get it wrong. A bad challenger equals random noise, and a good challenger equals surgical pressure. It looks at easy wins, repeated patterns, and weak edge cases, and then creates a harder version of exactly the same scenarios. Let me share one of my personal experiences. When we built systems as data engineers for our own organization, we had

Instructor:50 different versions of these different feed-processing data jobs, and trust me, most of them were revised at least 25 to 30 times before they got to 99% accuracy. So it is very important to take smaller steps and let the judge and challenger keep fighting against each other and build the best-use case scenario. If your system just fails more, you don't improve anything. You just stress it, and that's not what we want. For any mutation,

Instructor:adversarial development is a feedback loop that keeps going and improving the signal with better scores every time we process that loop. That's why I always say, without plate-breaking, no climb, no real learning. So lock this in: self-challenging works only if the judge stays fixed and the challenger is intelligent. We'll see hands-on how we build both synthetic and natural challengers. The executor is forced to adapt, and if you miss one of these,

Instructor:the whole loop collapses. Before we jump in, grab the GitHub repo. You will find the link below in the description. We are going to trace this live. Now, here is your four-minute lesson. Open the CleanLoop challenger path. Don't just read the code. Think like an operator. Ask yourself where are the harder fixtures generated? What signal is used to detect the weaknesses? How the judge is protected from being modified? And so that we discuss challenge generation, execution, and evaluation.

Instructor:This key insight should stick: the system increases difficulty without touching the correctness, and that's the safety mechanism. Once you see that, you understand self-challenging. This is helping you think differently. Think, then press the notification bell so you don't miss any next lesson or any new courses we release. So let's go to the hands-on lab now. So we are in the hands-on lab, and as usual we will start with the chart here. You can see the review is the same

Instructor:old executor-challenger arena, and then it shows the flow here. Now the four important theories that I would like you to remember again: judge and challenger are not the same tool. Always remember that one is there to validate, and the other is in the arena to test the system itself. Fixed selection pressure is what makes the improvement meaningful. Never try to throw random permutations or experiments. Yes, there are good cases for random experimentation as well, toward the end, to

Instructor:identify the edges and the boundaries. But when we are progressing toward better maturity, the fixed selection process into very accurate, pinpoint improvement over the data is always going to make better decisions. Good challenge is always targeted and not random. What we already discussed with point number two is that self-challenging creates the curriculum pressure. So what does curriculum pressure mean? We identify subdomain challenges. For example, this is

Instructor:aligned data, incorrect data, and all sorts of things, and all that curriculum we define using various computation playbooks, and therefore we start with the basic curriculum, and then we progress toward the higher end of the curriculum. That will also overlap with number two and number three we discussed, the fixed connection and a good challenger. Now I would leave this for you to read. It's it's a very detailed document to read. We will see here how this work evaluation

Instructor:endpoint, that everything is provided here. So you will be able to understand how we are actually working on those. Up there now is the binary check registry. So whenever we are building that result set, we actually build the binary tree that helps us build the complex decision matrix. Now other than that, the important thing is the difficulty ladder. Now here I have created a five-level difficulty curriculum, and you can see here what it does. Mild

Instructor:finance messiness is just moving, breaking ISO, voids, status spelling, and whatever it is, but it's a very mild messiness. Obviously this is not going to pass the deterministic path, but it's still mild. Moderate finance messiness uses mixed date format DDMM, YY, and all sorts of stuff: the currency symbol, the currency code, and everything, which is slightly more difficult than number one. Then number three is hard finance dispute invoices, free trial, complimentary, you know, all sorts

Instructor:of stuff that probably makes your system more complex to resolve, because it's a business context that needs to be understood, and very hard negative reversals. These are the hard positive and hard negative cases. Four and five are really hard positive and hard negative cases. If you are attempting number four and number five, make sure you have a powerful LLM to solve them. Most likely a lightweight LLM won't be able to solve number four and number five. Also, to prepare for number four and five,

Instructor:we need to have a stronger hint mechanism, the skill mechanism, and prompt mutation as well, which I'm not going to cover in this particular use case. So we are going to leave four and five outside the scope as of today. But if you want to expand this example into prompt mutation and other areas, you can certainly use number four and number five to check how far you can stress this example. Now let's go back to our preview back again. I think all of these things

Instructor:are important. Please go and validate them. Also read in this order. It will help you understand the flow that we discussed. The run is very simple. All I'll do is I'll go into CleanLoop. Currently, I think the dashboard is running. So I'm going to stop the dashboard. Okay. And let's check the status. When we check the status, I think the status is pretty fine. Absolutely. We can see we got adjustments and finance invoices, and then challenger files. We will actually create

Instructor:the challenger files ourselves as well. So let's go and delete those challenger files from input first. So let's just delete these challenger files. Now what we want to do is we already need to probably get this from there, so let it step. I'll do that in the end, don't worry, but I'll probably use this command, which is better. So I've deleted all the challenge manifests and level five. Then I'm going to

Instructor:create the one, two, and three labels that we discussed. So let's see how it works. So you can actually choose the labels you want. I have selected three adversarial levels, one, two, and three. And by the way, they all are based on the levels that we saw, difficulty levels one, two, and three. And I said for four and five, we most likely would need some heavier LLMs, which we are not using in this case. We're just using simple Phi. These one,

Instructor:two, and three Phi models would be able to handle pretty much everything easily. So we've got all the adversarial files, and this is synthetic, by the way. So we create the synthetic data to challenge, and this is also a very powerful way of training your own loop. So now let's go back and train our loop as well. And then evaluate. So by evaluating we will know what is working and what is not working. So we can see here while evaluating, obviously, the expected and unexpected cases which are not fixed. So that's

Instructor:why it's 13 by 4. We cannot fix everything, but we can fix many of them, and we can see there are still 85 needing mutation, still unresolved after mutation zero. So now all we're going to do is run the loop. By the way, the evaluate command will only do the dry run. It won't do the full run. And now the loop is actually running and generating more data for us. It is right now generating a mutation request, and as it anticipated, mutation needed 65 and

Instructor:unresolved after mutation zero. So it was pretty much successful, and if you can see here we got very much here, and these are the mutations we recorded, and these all were generated after the deterministic pass, obviously, because the deterministic path wasn't able to complete them. So let's go and now validate our dashboard. I actually need to also run the dashboard command. Observe is the same as dashboard just in case you want to reprint what the last run is, but observe will just

Instructor:reprint exactly what the last run is. And this is now 8501. So I'm just going to refresh it and even try it two ways. This is the last run, or this last turn on top. But you can also select the current artifact for the last run, and you can see what we processed. It will show you everything that we have produced as output, and it will also show us in data quality what we actually produced. So we can see how many were processed deterministically. Adversarial

Instructor:mutation playbook, how the adversarial mutation playbooks have been processed. We can see what the anomaly reasons were, like impossible dates on these invoices, and it will give you very much detail as a report about what happened, and how, and where things got stopped, with very, very detailed observability as well. So yeah, you can actually see it on every run. So these are all different runs of invoice INV-005, and this is how we can actually compare whether the invoice

Instructor:INV-005 failed before, and whether the invoice worked this time, and that's why what we previously discussed in observability is important. So this is the overall idea which I give you for this judge self-challenging lesson. And I think I have messed up between number four and number five, which I have to fix anyway. But I will fix that in GitHub so you guys have direct access to that. But other than those commands which I misplaced in

Instructor:the exercise, it's pretty important, guys. From here what we are looking at is something more serious. Earlier we just saw that we send the data to the LLM and get the mutation surface. But now it's a point of how we can actually improve our mutation surface, how we can self-generate those hints, how we can self-generate direction per prompts, and a lot more things. There are other areas which I would say, when you're talking about the judge, challenger, and arena, there are other

Instructor:areas which are not covered in this particular course. But keep an eye on my channel. I'm going to release a course on prompt mutation, and I'm going to release a course on behavioral learning. By the loops themselves, one maintains a behavioral skill, and it mutates the skill to keep the learning in long-term memory, and prompt mutation is also long-term to the operational memory perspective, and it also generates a lot of adversarial hints about how to handle

Instructor:different adversarial use cases. Both overlap, by the way. All three overlap. All three are different mutation surfaces. They all overlap, but with all three of them we can actually have a long-term behavioral understanding, and in judgment and training we have mutation surfaces by code, and we also have the mutation surface by prompt invocation. So these three will always make things better. All of my use cases, all of my examples, or the real-life implementations are

Instructor:basically with all these three. There are very few where I have actually gone to reinforcement learning. I have a couple of agents where we use reinforcement learning, but they are very advanced use cases and generally not needed for day-to-day purposes. But having said that, this is more important to understand from a theoretical and example point of view because you're picking up those things. But as I said, I'm going to have those two courses as well. And when

Instructor:we have all three courses, it will make a lot of sense together, and you will be able to make really advanced self-improving agents. It doesn't matter whether it's a data agent or a stock agent, or a trader, or whatever you would like to make. But with these three capabilities and abilities in your hand, you will be able to practically build any business-case agent as you wish, with complex workflows, with reinforcement learning as well. So that's good, and I would

Instructor:suggest that you go ahead and implement these exercises as well. It will help you understand the concept far better. Now let's go back to our presentation. Right, so now your loop isn't just running, it's under pressure, and as I said, the adversarial pressure, we keep pushing levels 1, 2, 3, 4, 5, and the judge stays fixed, and the challenger keeps pushing, and the executor has no choice but to improve. That's how the new system evolves. Next you go even further. And by the way, before

Instructor:I end this lesson, there are two different ways an evolution system can work. One is with human observation, and one is without human observation. And both are very powerful techniques. As I said, when we have all three, then we will also have a bonus lesson on human-observed and human- independent evolution processes. Both are very interesting topics to discuss as well. But next we go even further. We introduce best-of-N and re-ranking. You have heard re-ranking in,

Instructor:for example, RAGs where you have better candidates and you choose among those candidates. This is exactly the same concept, but we will choose between candidate mutations. We will choose the better mutation for use cases, and so the system can compare multiple candidates before committing to a mutation. Make sure you are subscribed so you don't miss that, because it's really getting interesting from here, and I would really love for you

Instructor:to continue this whole course, and I'll see you in the next lesson as well. Take care then. Bye-bye.

Generate Pressure Without Moving the Goalposts

0/3

Create one fresh adversarial arena

Clear the old challenger files and generate a fresh adversarial set so you know exactly which pressure level you are testing.

powershell

cd _examples/self-improving-agent/cleanloop
Remove-Item .input\adversarial_d*.csv -ErrorAction SilentlyContinue
python util.py challenge --levels 1 2 3

Evaluate the same fixed judge

Run evaluation and one loop pass so you can compare how the unchanged referee responds to the harder data surface.

bash

python util.py evaluate
python util.py loop --max-iterations 1

Inspect challenge outcomes in the dashboard

Open the dashboard after the adversarial run and inspect which rows stayed deterministic, which required mutation, and how the judge reported the outcome.

bash

python util.py dashboard

nilayparikh/tuts-agentic-ai-examples/tree/main/self-improving-agent/cleanloopGitHub

Complete source code for this lesson.

github.com/nilayparikh/tuts-agentic-ai-examples/tree/main/self-improving-agent/cleanloop

Q&A

Q & A

Why is the fixed judge the central rule in this lesson?

Because if the judge changes with the challenger, the system can no longer tell whether it truly improved. The lesson keeps correctness fixed so harder data increases pressure without redefining success.

What makes a good challenger in this lesson?

A good challenger creates realistic, finance-aware anomalies that target observed weaknesses. It should increase difficulty in a way the operator can still understand and debug, not just flood the system with random noise.

Why does the lesson spend time on curriculum pressure instead of only one adversarial example?

Because pressure should scale. The difficulty ladder lets the loop face mild, moderate, and harder cases in an intentional order, which is more useful than one-off chaos when you want the system to improve over time.