Unsupervised LLM Evaluations
Practitioners guide to judging outputs of large language models Daniel Kharitonov · Follow Published in Towards Data Science · 12 min read · 23 hours ago — <TLDR> Evaluating AI-generated outputs is critical for building robust applications of large language models because it allows complex AI applications to be split into simple stages with built-in error control. It is relatively straightforward to evaluate generative outputs in a supervised mode, where the “right answers” can be