LLM Evaluation

Abstract

In this presentation, we explore recent advancements in large language model (LLM) evaluation, focusing on LLM-as-a-judge methods and open evaluator models. Traditional evaluation metrics often fail to address task-specific requirements, prompting the development of LLM-based evaluators.First, **LLM-as-a-Judge** (Zheng et al., 2023) is introduced, which highlights the scalability and explainability of LLMs for evaluation while addressing biases like verbosity and position bias. Next, we go through **G-Eval** (Liu et al., 2023), a framework leveraging Auto CoT for evaluation efficiency and proposing a probability summation scoring method to overcome challenges like low variance in scores. Finally, we examine **Prometheus 2** (Kim et al., 2024)), an advanced open-source evaluator model, demonstrating high alignment with human and GPT-4 evaluations through weight-merging training strategy.

Term

Fall 2024

Date

November 22, 2024

Time

3:00 - 4:00 PM

Location

White Hall 100

Abstract

Links