Global ETD Search

1	A Method for Automated Assessment of Large Language Model Chatbots : Exploring LLM-as-a-Judge in Educational Question-Answering Tasks Duan, Yuyao, Lundborg, Vilgot January 2024 (has links) This study introduces an automated evaluation method for large language model (LLM) based chatbots in educational settings, utilizing LLM-as-a-Judge to assess their performance. Our results demonstrate the efficacy of this approach in evaluating the accuracy of three LLM-based chatbots (Llama 3 70B, ChatGPT 4, Gemini Advanced) across two subjects: history and biology. The analysis reveals promising performance across different subjects. On a scale from 1 to 5 describing the correctness of the judge itself, the LLM judge’s average scores for correctness when evaluating each chatbot on history related questions are 3.92 (Llama 3 70B), 4.20 (ChatGPT 4), 4.51 (Gemini Advanced); for biology related questions, the average scores are 4.04 (Llama 3 70B), 4.28 (ChatGPT 4), 4.09 (Gemini Advanced). This underscores the potential of leveraging the LLM-as-a-judge strategy to evaluate the correctness of responses from other LLMs. LLM-as-a-Judge Chatbot Large language model LLM chatbot GPT-4o Llama 3 70B ChatGPT 4 Gemini Advanced Automatic evaluation Automatic evaluation parameters Education History Biology Computer Systems Datorsystem