💻Video + Code Examples·8 mins

Test-Time Reranking

Nilay Parikh

Add best-of-N candidate search to CleanLoop. This lesson shows why reranking improves output quality only when each candidate stays bounded, isolated, and scored by the same fixed judge.

Thumbnail for Test-Time Reranking — Test-Time Reranking · 8 mins

Transcript27 entries

Instructor:Think of your pipeline like hiring a candidate. If you only interview one person, you're gambling. But if you shortlist a few strong candidates and compare them side by side and then choose, you make a far better decision. That's exactly what test-time search does. Instead of trusting one proposal, you generate a small set, compare them, and pick the best one before the final evaluation. Now here is the trade-off. More options equal better quality, but also more cost and latency.

Instructor:I remember working on a more quantitative pipeline when I was trying to introduce a map-reduce style layer for forward curve generation. The tricky part was that the different seasonal patterns would fit completely different algorithms. One model performed better in stable periods, while the other one with hyperparameters better handled the volatility. Initially I relied on a single-shot approach, but it was not getting us there where we wanted. So later we let both algorithms run in

Instructor:parallel, and then using the judge and a feedback reranker we actually shortlisted which one would probably perform the best and then generated the full forward curve. If you are building a real AI system, this is where things will get really interesting. So one request, if you want to keep building systems like this with me, make sure you hit subscribe and press the notification icon. This course is designed to take you from basic to production-grade thinking. I want you to

Instructor:always think this course is one big example. We are not stacking random ideas. We are building a pipeline where every piece keeps running underneath. In the previous lesson we increased the challenge pressure but kept the judge fixed. We are not replacing anything here. We are just extending the same mutation loop by adding the bounded search layer before the evaluation. So keep the system in your head: proposal, judge, and feedback. Now we are upgrading proposal from

Instructor:one guess to best-of-few, and that's a shift. Here is the mental model I use. One failure comes in, we fan out into multiple candidates. We compare them and then we pick the one survivor. Then send only that forward. It's like a tournament. Instead of betting everything on one answer, you sample a few. This is especially useful when outputs are unstable, prompts are ambiguous, or quality varies run to run. But yeah, you're paying for it. More tokens, more money, more latency. This is the

Instructor:underrated part. The reranker doesn't ask, "Is this good?" It asks, "Which one is better?" That's a completely different kind of intelligence. And it's usually cheaper if you need to run all the candidates in parallel before choosing it than letting every candidate go through the full evaluation. This is engineering reality. There's no magic, and you're trading higher reliability for higher compute cost. You need to ask, is the improvement worth it? Let me explain with my previous example.

Instructor:When we actually had a forward curve, because of the forward curve's inefficiency there was a material loss of trade, and that was a good enough reason for us to actually go for best-of- N cases. Not every use case will support or actually afford best-of-N cases, but it's definitely, as I said, a case-by-case basis. If it is worth money, then it is worth money. Let's zoom out again: small fan-out, cheap comparison, one survivor, controlled cost, and that's how you

Instructor:keep search practical. Before we jump into lab, you will find the full implementation in the GitHub repo link below in the description. Make sure you pull it before we are about to walk through together. Now let's get started. So now, before you get your repo ready and the CleanLoop implementation, the goal here isn't just to run it. The goal is to think like an operator. What you're going to do is run the pipeline in one-shot mode. Then enable best-of-N search and observe where the

Instructor:candidates are generated, where the reranking is happening, which one gets selected. What you look for when you run those examples, look for: do multiple candidates differ meaningfully, does the reranking pick what you expected, and does the final judge still behave the same way? This is the key. Judge does not change. Only input and quality improve. That's the boundary that keeps this system stable. Now once you've done that, we'll come back on the presentation. So let's go on hands-on.

Instructor:We are hands-on now, and best place to start is Lesson Six here. So on that Lesson Six you already saw the diagram, and if you want to catch up you can catch up on the previous lessons as well. Test-time is a common algorithm in search: spend the compute on selection, not training. Now there are multiple ways of implementing test-time. One is a full implementation. That means we get the final results and then compare. The other one is a partial result. In a mathematical sense,

Instructor:many machine learning algorithms can actually tell us what a 25th optimization round looks like. What does a sampling look like? So you can always run based on sampling. You can always run based on distribution. You can always run based on any other selection criteria that can reduce from full run to a partial run. But make sure the amount of partial run you do is meaningful. So whatever the reranker decides is a lightning scenario. But if the partial run cannot be trusted,

Instructor:then there is unfortunately no other best way, but you have to make the full run and then make a choice. However, if partial can be supported, then it's a best use case. Most mathematical solutions can be precursors. That means we can actually identify the best candidate way before we complete the process. And that means we will only let the process move forward beyond a point where we have full confidence. Or you can choose in a waterfall manner, which means it starts with

Instructor:10, then filter out five, then filter out three, and then remaining two runs to the end and one gets picked up. So there are many cases how you can implement test-time search. There is no one particular answer. Isolation is a part of the search contract. So please make sure the candidates run in a sandbox or a temp directory, they are not influencing each other or contaminating the result. The fixed judge makes reranking meaningful. This is one other important pillar which I want to ensure, that every candidate is

Instructor:scored by the same judge. Make sure the judge and the hints provided are fair. They are not biased, otherwise generally it will produce a biased result, and that's one of the problem areas where people struggle to figure out why the wrong result has been selected. It's not because it was intentional, but unintentionally some of the hints are biased. That means the main anchor and the judge is always picking up and leaning toward one sort of choice, whether

Instructor:they are right or wrong, because they don't ask the question, "Right or wrong?" They ask, "Where is the better score?" and that's why we need to make sure the score is as honest as possible and search doesn't stack the deck on token cost, which we already discussed. There is no one better answer, or there is no one best answer. You have to pick up the right approach based on your use cases and also how much cost we can actually incur to get that kind of business value. So there is sufficient

Instructor:documentation here, and I would recommend these code anchors. So if you go on the best-of- N cases, you will find out what these algorithms look like and how we have created them, and there is enough documentation paths to actually visit in the examples, and in the exercises we actually ask you to do some more work on it. So it might be interesting for you to evaluate and actually write your own reranker. In real-world cases there will always be multiple rerankers. They are similar

Instructor:like RAG rerankers, but they're deterministic. Here you can also have nondeterministic rerankers too. I have also seen many top-end research organizations use similar packs as a harness agent to actually do RAG retrieval. And this is a very powerful method. If you have very high-value research going on and you want to find almost pin-perfect context for LLMs, this is a very good model which can actually lead the mutation-level, mutation-based content

Instructor:finding with proper reranking. It's one of the topmost patterns and implementations that you can go for for highest accuracy at this point in where we are, basically in terms of technical advancement. This is the highest possible quality setup that you can have for near-perfect recall and near-perfect retrieval. So we already know how status, verify, reset, and evaluate works. I'm probably not going to do that, but if you want to rerank, it's very simple. Here just use this function

Instructor:and command which will rerank, and it will create two candidates: conservative and value-first. So not very difficult to understand. But however, let me honestly first put a disclosure. This example is not very deep for actual reranker use cases. We need to have a very, very close-to-real- world example, and to be honest, hardly five or six percent of use cases in real world can actually justify the reranking process. So in this one we are just demonstrating how it works from the pattern

Instructor:perspective. There is no real value out of the reranker. We haven't got that complex algorithm and that complex use case, because if you put even that complex use case, the challenge is for most of the learners, that example will become very difficult to handle. So here we can see what happened. There is a value-first approach. The candidate two and conservative was the first approach. You can see that they have both used different tokens, different values,

Instructor:and all sorts of stuff. Finally they sorted out everything, picked up, they basically fixed 54 rows, still need mutation 65, and then the mutation patched it, and then results to many events. So out of around 54 was actually fixed by the mutation preview itself, and that's what says, fixed rows 54, played by the mutation itself. So that's basically an outcome we are talking about. Now go with the exercises, and you can see here most of them are medium or hard,

Instructor:but I would say these are one of the harder exercises to actually achieve. The dashboard is also running here. So we can go on dashboard and validate what the last run looked like. The current artifacts, you won't see kind of a difference here, but yeah, it will be helpful to understand what each run and how it processes different things. Now reranking can also be extended into more advanced patterns which we are not covering in this course,

Instructor:called hybrid fusion. Now fusion is where two or three different candidates process the sections of code where we are not evaluating at the atomic level of the candidate, but actually we validate the subatomic level, and that means individual groups of records which algorithm is dealing better, and then based on that we actually merge it. So we fuse it. We don't actually select 1, 2, 3 out of that, but maybe 30 rows from one, 60 rows from two, and 30

Instructor:rows from three, and then we merge like a git merge and then produce a final result. It's also a very powerful pattern and that also can deliver us some advanced results, but that's not covered here. For delivering something like that you need a very strong merge logic. And that's also a sort of a data-structure challenge for anyone who would like to implement extension in this example and put a PR. We will be able to share with everyone if you

Instructor:manage to get that through as well. But yeah, that sounds good, and now I'm going to go back without wasting a lot of time there. But here you can see the pretty much latest score, 13/14. So we are good with it. Okay, we are back now. So now your loop can search, compare, and choose instead of blindly committing to the output. That's a huge step. But remember this only works because it's so bounded. And in any self-improvement agent, bounded is

Instructor:the key word more than self-improvement, because self-improvement without bounded is most likely anarchy and randomness. But with the boundary it can be a very meaningful purpose. So you control how many candidates, how much cost and latency, and that's the engineering. Next we are closing this course by adding the production safety rails and gradually the autonomy model. So your system doesn't just work, it behaves responsibly at scale. And if you made it this far, subscribe.

Instructor:stick with me and let's finish this strong. Take care. I'll see you another video.

Compare One-Shot vs Reranked Mutation

0/3

Reset and capture the baseline

Reset the genome and run one bounded loop without reranking so you have a clean baseline before you widen search.

bash

python util.py reset
python util.py loop --max-iterations 1

Enable best-of-N search

Turn on reranking with two candidates and inspect which candidate survives the fixed judge.

bash

python util.py loop --max-iterations 1 --rerank --candidates 2

Review the saved evidence

Open the dashboard or observe output so you can inspect candidate width, token cost, and the final selected attempt.

bash

python util.py observe
python util.py dashboard

nilayparikh/tuts-agentic-ai-examples/tree/main/self-improving-agent/cleanloopGitHub

Complete source code for this lesson.

github.com/nilayparikh/tuts-agentic-ai-examples/tree/main/self-improving-agent/cleanloop

Q&A

Q & A

Why is reranking different from changing the judge?

Because reranking spends more inference-time budget before commit, while the deterministic judge still defines success the same way for every candidate.

When is best-of-N search worth paying for?

When a better result has clear business value and one-shot quality is too unstable. If the gain does not justify the extra latency and token cost, the simpler path usually wins.

Why does the lesson insist on isolation and fairness?

Because reranking is only trustworthy when candidates do not contaminate each other and the same fixed judge evaluates all of them with the same hints and scoring rules.