Global ETD Search

Return to search

A Method for Automated Assessment of Large Language Model Chatbots : Exploring LLM-as-a-Judge in Educational Question-Answering Tasks

This study introduces an automated evaluation method for large language model (LLM) based chatbots in educational settings, utilizing LLM-as-a-Judge to assess their performance. Our results demonstrate the efficacy of this approach in evaluating the accuracy of three LLM-based chatbots (Llama 3 70B, ChatGPT 4, Gemini Advanced) across two subjects: history and biology. The analysis reveals promising performance across different subjects. On a scale from 1 to 5 describing the correctness of the judge itself, the LLM judge’s average scores for correctness when evaluating each chatbot on history related questions are 3.92 (Llama 3 70B), 4.20 (ChatGPT 4), 4.51 (Gemini Advanced); for biology related questions, the average scores are 4.04 (Llama 3 70B), 4.28 (ChatGPT 4), 4.09 (Gemini Advanced). This underscores the potential of leveraging the LLM-as-a-judge strategy to evaluate the correctness of responses from other LLMs.

http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-131267

Automatic evaluation parameters

Identifer	oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:lnu-131267
Date	January 2024
Creators	Duan, Yuyao, Lundborg, Vilgot
Publisher	Linnéuniversitetet, Institutionen för datavetenskap och medieteknik (DM)
Source Sets	DiVA Archive at Upsalla University
Language	English
Detected Language	English
Type	Student thesis, info:eu-repo/semantics/bachelorThesis, text
Format	application/pdf
Rights	info:eu-repo/semantics/openAccess

Page generated in 0.0016 seconds

A Method for Automated Assessment of Large Language Model Chatbots : Exploring LLM-as-a-Judge in Educational Question-Answering Tasks

Description

Links & Downloads

Tags

Additional Fields