Evaluating the Suitability of Large Language Models (LLMs) in the Health Sector Using the BERTScore Method

Authors

  • Listia Baene Universitas Buddhi Dharma
  • Dram Renaldi Universitas Buddhi Dharma
  • Edy Edy Universitas Buddhi Dharma

Keywords:

BERTScore, ChatGPT, Google Colab, LLMs, Python

Abstract

This research was motivated by the increasing use of Large Language Models (LLMs) such as ChatGPT in providing health information, which requires quantitative evaluation of the validity of the answers generated. The objective of this research is to develop and test a Python-based automated testing instrument in Google Colab to assess the quality of ChatGPT's health-related answers using the BERTScore semantic metric. The methods used include a CSV dataset containing medical questions, reference answers, and ChatGPT answers, calculation of Precision, Recall, and F1-Score BERTScore values, answer quality labeling, analysis by category, visualization of results, and validation of SQA, interface, and coding structure aspects. Testing of 100 questions across 10 categories showed high BERTScore scores (around 0.84–0.89) with a percentage of “good” answer suitability of 40–90%, where the categories of Human Anatomy and Physiology and Nutrition and Dietetics achieved the highest percentages, while Pharmacology, Public Health and Prevention, and Ethics and Law were in the lowest range. Validation by three experts resulted in an average score of 82.6%, so the developed instrument was declared to be functioning well and sufficiently reliable for evaluating LLMs answers in the field of Health.

Downloads

Download data is not yet available.

Published

2026-01-07