← Seminars

An Investigation of Human Evaluation Methods on State-of-the-Art Chatbots

Sarah Finch , James Finch

Abstract

Despite the recent popularity of conversational AI research, the evaluation of chat models remains a significant challenge in the field. Likert scale human judgements are the most prominent evaluation methodology, but high variance and a lack of standardization of these evaluation labels make it difficult to make high-confidence comparisons between models. To address this, we investigate alternative human judgement methodologies for chatbot evaluation, including comparative judgements and a novel behavior coding scheme. We apply these various evaluation methodologies to four chatbots that have each achieved a state-of-the-art result in some aspect of conversational ability. In this presentation, we present our current progress, including the design of our behavior coding scheme, bot selection and replication methods, and results from our pilot studies.

Term
Spring 2022
Date
March 18, 2022
Time
4:00 - 5:00 PM
Location