Global ETD Search

1	Comparative Analysis of ChatGPT-4and Gemini Advanced in ErroneousCode Detection and Correction Sun, Erik Wen Han, Grace, Yasine January 2024 (has links) This thesis investigates the capabilities of two advanced Large Language Models(LLMs) OpenAI’s ChatGPT-4 and Google’s Gemini Advanced in the domain ofSoftware engineering. While LLMs are widely utilized across various applications,including text summarization and synthesis, their potential for detecting and correct-ing programming errors has not been thoroughly explored. This study aims to fill thisgap by conducting a comprehensive literature search and experimental comparisonof ChatGPT-4 and Gemini Advanced using the QuixBugs and LeetCode benchmarkdatasets, with specific focus on Python and Java programming languages. The re-search evaluates the models’ abilities to detect and correct bugs using metrics suchas Accuracy, Recall, Precision, and F1-score.Experimental results presets that ChatGPT-4 consistently outperforms GeminiAdvanced in both the detection and correction of bugs. These findings provide valu-able insights that could guide further research in the field of LLMs. Large Language Models (LLM) ChatGPT Gemini Advanced Code Debugging Natural Language Processing (NLP) Erroneous Code Code Detection Code Correction Computer and Information Sciences Data- och informationsvetenskap
2	A Method for Automated Assessment of Large Language Model Chatbots : Exploring LLM-as-a-Judge in Educational Question-Answering Tasks Duan, Yuyao, Lundborg, Vilgot January 2024 (has links) This study introduces an automated evaluation method for large language model (LLM) based chatbots in educational settings, utilizing LLM-as-a-Judge to assess their performance. Our results demonstrate the efficacy of this approach in evaluating the accuracy of three LLM-based chatbots (Llama 3 70B, ChatGPT 4, Gemini Advanced) across two subjects: history and biology. The analysis reveals promising performance across different subjects. On a scale from 1 to 5 describing the correctness of the judge itself, the LLM judge’s average scores for correctness when evaluating each chatbot on history related questions are 3.92 (Llama 3 70B), 4.20 (ChatGPT 4), 4.51 (Gemini Advanced); for biology related questions, the average scores are 4.04 (Llama 3 70B), 4.28 (ChatGPT 4), 4.09 (Gemini Advanced). This underscores the potential of leveraging the LLM-as-a-judge strategy to evaluate the correctness of responses from other LLMs. LLM-as-a-Judge Chatbot Large language model LLM chatbot GPT-4o Llama 3 70B ChatGPT 4 Gemini Advanced Automatic evaluation Automatic evaluation parameters Education History Biology Computer Systems Datorsystem

Search results

Comparative Analysis of ChatGPT-4and Gemini Advanced in ErroneousCode Detection and Correction

A Method for Automated Assessment of Large Language Model Chatbots : Exploring LLM-as-a-Judge in Educational Question-Answering Tasks