Here’s how not to waste your budget on evaluating models and systems
You can build a fortress in two ways: Start stacking bricks one above the other, or draw a picture of the fortress you’re about to build and plan its execution; then, keep evaluating it against your plan.
We all know the second one is the only way we can possibly build a fortress.
Sometimes, I’m the worst follower of my advice. I’m talking about jumping straight into a notebook to build an LLM app. It’s the worst thing we can do to ruin our project.
Before we begin anything, we need a mechanism to tell us we’re moving in the right direction — to say that the last thing we tried was better than before (or otherwise.)
In software engineering, it’s called test-driven development. For machine learning, it’s evaluation.
The first step and the most valuable skill in developing LLM-powered applications is to define how you’ll evaluate your project.
Evaluating LLM applications is nowhere like software testing. I don’t undermine the challenges in software testing, but evaluating LLMs isn’t as straightforward as testing.