Spelling suggestions: "subject:"automatic evaluation parameters"" "subject:"2automatic evaluation parameters""
1 |
A Method for Automated Assessment of Large Language Model Chatbots : Exploring LLM-as-a-Judge in Educational Question-Answering TasksDuan, Yuyao, Lundborg, Vilgot January 2024 (has links)
This study introduces an automated evaluation method for large language model (LLM) based chatbots in educational settings, utilizing LLM-as-a-Judge to assess their performance. Our results demonstrate the efficacy of this approach in evaluating the accuracy of three LLM-based chatbots (Llama 3 70B, ChatGPT 4, Gemini Advanced) across two subjects: history and biology. The analysis reveals promising performance across different subjects. On a scale from 1 to 5 describing the correctness of the judge itself, the LLM judge’s average scores for correctness when evaluating each chatbot on history related questions are 3.92 (Llama 3 70B), 4.20 (ChatGPT 4), 4.51 (Gemini Advanced); for biology related questions, the average scores are 4.04 (Llama 3 70B), 4.28 (ChatGPT 4), 4.09 (Gemini Advanced). This underscores the potential of leveraging the LLM-as-a-judge strategy to evaluate the correctness of responses from other LLMs.
|
Page generated in 0.1491 seconds