ABSTRACT
Objectives: To assess the accuracy of ChatGPT-4 Omni (GPT-4o) in biomedical statistics. The recent novel inauguration of Artificial Intelligence ChatGPT-Omni (GPT-4o), has emerged with the potential to analyze sophisticated and extensive data sets, challenging the expertise of statisticians using traditional statistical tools for data analysis.
Methods: This study was performed in the Department of Physiology, College of Medicine, King Saud University, Riyadh, Saudi Arabia, in May 2024. Three datasets in a raw Excel file format were imported onto Statistical Package for the Social Sciences (SPSS) version 29 for data analysis. Based on this analysis, a script of 9 questions was prepared to command GPT-4 Omni, which was used for data analysis for all 3 datasets on Omni. The score and the time were recorded for each result and verified after being compared to the original analysis results performed on SPSS.
Results: GPT-4 Omni scored 73 (85.88%) out of 85 points for all 3 datasets. All datasets took a total of 38.43 minutes to be fully analyzed. Individually, Omni scored 21/25 (84%) for the small dataset in 487.4 seconds, 20/25 (80%) for the middle dataset in 747.02 seconds and 32/35 (91.42%) for the large dataset in 1071 seconds. GPT-4 Omni produced accurate graphs and charts.
Conclusion: ChatGPT-4 Omni scored better over 80% in all 3 statistical datasets in a short period. GPT-4 Omni also produced accurate graphs and charts as commanded however it required explicit commands with clear instructions to avoid errors and omission of results to achieve appropriate results in biomedical data analysis.
The 21st century has witnessed the most remarkable advancements in the fields of science, medicine, and technology. These innovations have aimed to improve the lifestyle and workload of individuals; however, the latest inauguration of Artificial intelligence (AI) has outdone the work of all its predecessors. Open AI’s software ChatGPT has been the pioneer product in this series with human cognitive reasoning and problem-solving skills. ChatGPT’s advanced natural language skills have made it a valuable tool across many domains, including education, healthcare, business and finance.1 Besides these remarkable talents, specific to the field of medicine, AI is now also being used widely for drug testing, imaging diagnosis and precision medicine.2-4
In scientific research specifically, ChatGPT can be quite beneficial in assisting academics with conducting their literature review, summarizing complex concepts into simpler terms, developing outlines and improving writing styles.1,5 Also, ChatGPT can generate innovative ideas and hypotheses.1 One area it lacked was an accurate statistical analysis of data. However, the recent updates have also filled this gap. The latest AI ChatGPT tool “Omni,” has emerged with the potential to analyze sophisticated and extensive data sets in a matter of seconds. Omni GPT6 is a custom GPT developed by Teddy Pena7 and is a powerful tool that is publicly available within OpenAI’s explore GPTs option. It is designed to handle and analyze data in various formats like Excel, CSV, and PDF. The process involves using Python libraries like pandas for data manipulation and Matplotlib for creating several types of graphs, including but not limited to histograms, scatter plots, line charts, bar charts, and box plots.8 While this software performs data processing and analysis in a shorter time, however, the accuracy of this software is yet to be evaluated. Therefore, this study aims to assess the accuracy of Chat GPT-4 Omni (GPT-4o) in biomedical statistics.
Methods
This study was performed in the Department of Physiology, College of Medicine, King Saud University, Riyadh, Kingdom of Saudi Arabia, in May 2024. Datasets were obtained from a publicly and open-access source, Kaggle,9-11 in a raw Excel file format, ensuring that they can be edited or imported onto Statistical Package for Social Sciences (SPSS). We chose three different sets to ensure comprehensive analysis and to evaluate the scalability of the AI-powered data analysis tool. The datasets included: small-sized (fertility dataset - 100 rows), medium-sized (sleep health and lifestyle - 374 rows) and large-sized (maternal health -1014 rows).
Statistical analysis
Statistical analysis was performed using traditional statistical software SPSS version 29 (IBMCorp, Armonk, NY, USA). These calculations and data analysis were then re-performed on an AI-powered data analysis tool, ChatGPT (Omni). The Bivariate and multivariate analyses were performed while using the various tests (Table 1). The following analysis was performed using both, for which results were later compared:
Statistical Package for Social Sciences analysis and ChatGPT response comparison
The datasets were imported into the software, and each statistical test was conducted step by step, in order of the questions above, using SPSS’s built-in tools. Initial analysis was conducted using the SPSS version 29 by one statistician which was then reviewed and reconfirmed by another statistician. All the results were recorded and exported.
The raw Excel file was uploaded to ChatGPT’s Omni tool interface to perform the same analysis (Figure 1). This was based on a set of nine questions (Table 2) developed by the research team in the same precedence as the analysis conducted on SPSS. Careful consideration of precise wordings was ensured to then form a script of the questions for commands to be entered into the Omni chat page, and all statistical tests were executed using this same script for all 3 datasets. However, certain questions underwent slight modifications upon being inputted into GPT.
To measure the precision, accuracy and performance of Omni, a score chart was formulated based on the 9 questions, the results of which were compared to the original analysis performed on SPSS. Each component within the question was assigned 1 mark. Questions containing multiple components, such as the choice of correct statistical test, p-value, F statistics, and B-coefficient, amongst others, accumulated a higher total mark accordingly. The total for all nine questions was summed up to a maximum of 35 marks, subject to the condition that all statistical tests could be performed for all datasets. The time was also recorded in seconds for each response given by Omni to assess speed performance.
Different ChatGPT Omni chats were used for small, medium, and large datasets to ensure the results were separated and no chat memory was retained regarding the choice of tests. While new features are being rolled out 12 that may allow the sharing of AI memory between different chat sessions with specific commands and settings, for this analysis, no such settings were enabled. The memory between different chat windows was not transferable. All the results from ChatGPT Omni’s Interface were saved and exported in PDF files. We also made notes regarding our team’s observations when using AI to provide qualitative insights about the user experience.
The study did not include any humans or animals. The data was obtained from a publicly available dataset; therefore, the study was exempted from ethical approval or informed consent.
Results
The statistical performance and time taken to respond to Omni were assessed using a set of 9 questions, as shown in Table 1. These results were then compared to the data analysis performed by SPSS and marked accordingly. ChatGPT-Omni scored 73 (85.88%) out of 85 points for all 3 datasets within 38.43 minutes (mins) (Table 3).
Small dataset results
The small dataset consisted of 100 rows and 10 variables, to which an extra variable was added in the form of “categorical age.” Of these 11 variables, 2 were continuous skewed variables, 6 were nominal and 3 were ordinal variables. This entire dataset was evaluated for 7 questions since the small dataset variables did not qualify for 2 statistical test questions. The Omni interface took 487.74 seconds (8.13 mins) to complete the entire dataset’s analysis, scoring 21 out of a total of 25 marks (Table 3).
When asked to re-code the string categorical variables, Omni did so correctly in 17.7 seconds, scoring a 1/1 mark; however, the assigned numerical values lacked a consecutive order in correspondence to the natural order of the categorical variables (such as 1= never, 0= occasional headaches, 2= daily). Question 2 also scored full 2 marks as age was categorized successfully in 15.38 seconds.
Descriptive analysis for Question 3a yielded accurate results for the mode, median and range in 44.1 seconds, scoring all 3 marks. When mean and standard deviation analysis was entered for Q3b, the analysis was performed correctly, scoring 2 marks. In Q3c, 1.5/2 marks were scored by Omni for the correct frequency and percentage results; however, 0.5 was deducted as the interface only analyzed one variable due to some technical error, 0.5/1 marks were scored in Q3d for skewness as Omni produced inaccurate results for half of the variables (high fever, smoking, surgery, diagnosis)
Omni’s choice of statistical test for Question 4 was inaccurate, yielding a score of 0/1 as it chose Pearson’s test, failing to check the normality of both continuous variables. However, when commanded to check normality on a second attempt, the system then rightly chose Spearman’s Correlation test, but marks were not granted here. The 2 marks gained were simply for the accurate Spearman coefficient and p-value. Q4B was inapplicable to this dataset as skewed continuous variables meant a simple linear regression could not be performed.
Question 5 scored all 3 marks for correct test choice of Mann Whitney and accurate test coefficient and p-value in 62.91 seconds. Similarly, all 3 marks were also awarded for Question 6, where OMNI chose the correct test and produced accurate results in 103.79 seconds. Here, it could be seen that the system began to lag as response time significantly increased. In Question 7, Omni scored 1/3 marks for the correct choice of Chi-square test in 41.5 seconds. However, it yielded an inaccurate Pearson coefficient and p-value. Hence, 2 marks were deducted. Multiple Linear Regression could not be performed for this dataset, so Q8 was omitted. Omni produced accurate charts as commanded in Question 9, hence scored 3/3 in 24.51 seconds (Table 3).
Medium dataset results
The medium dataset consisted of 374 rows. Analyzing the whole question set (7 questions, as 2 did not apply to our dataset) took 747.02 seconds (12.46 mins), scoring 20/25. For Question 1, in the raw file, for body mass index the categories “normal” and “normal weight” were listed separately. While a human statistician would recognize these as the same category and merge them, the AI interpreted them as distinct categories, due to which a score of 0/1 was given. Artificial intelligence only reported this accurately when we reuploaded the updated and corrected raw file where the categories were combined.
Additionally, when re-coding the string variables to numeric groupings, there was a lack of common reasoning on behalf of AI where it allotted data values in a non-consecutive way. For example, normally, we would recode BMI as normal=1, overweight=2 and obese=3, but AI re-coded it as 1=obese, 2=normal and 3=overweight. Questions 2 and 3 were accurately completed within 23.88 and 124.67 seconds, respectively.
For Question 4, AI initially made an incorrect choice by choosing the Pearson test for correlation, given the data was skewed, due to which 0/1 marks were given. However, upon inquiring about the normality of the data, OMNI re-ran the analysis and, on a second attempt, made the right statistical test choice for Spearman’s test. Two marks were given here for producing the accurate coefficient and p-value results. Question 5 was also performed accurately within 67 seconds.
For Question 6, AI chose the correct test and did the analysis precisely in 120 seconds. At this point, the system was overburdened and lagging hence, the team decided to move to a new chat box. Question 7 scored full and was completed in 55 seconds. Question 8 was not applicable as the test assumptions were not met by the data set. When asked to produce charts for Question 9, Omni reported persistent internalized errors and was unable to plot charts, due to which it scored 0/3.
Large dataset results
Our large dataset consisted of 1014 rows. OMNI took 1071.31 seconds (17.86 minutes) to do the complete analysis based on all 9 questions, all of which were applicable to this dataset. From the total of 35, it scored 32.
For Questions 1 and 2, specific commands had to be given regarding re-coded categories. The results were accurate, and it took around 50 seconds for both questions. On Question 3, OMNI scored full 8/8 marks for the accurate calculations and took around 1 minute and 32 seconds.
In Question 4, For the association between 2 continuous variables, it chose the right tests based on the normality and linearity of the variables and reported the right test statistics and p-values in 2.92 minutes, gaining a full 7/7 score.
For Question 5, Omni chose the wrong test initially. Our variables were continuous and dichotomous categorical variables for which the t-test is more appropriate, but Omni chose ANOVA and took 108 seconds to do the full analysis. We later re-entered the command, clearly specifying the given dichotomous categorical variables and asking if ANOVA was the right choice. Omni corrected itself, choosing the right test in the second attempt and giving the correct analysis in 120 seconds. Questions 6 and 7 also produced accurate results, getting full scores.
Multiple linear regression was fully analyzed in 39 seconds for Question 8, but the results were only partially correct, giving a score of 2/4. The calculated R2 and p-value were right, but the F statistic and test coefficient were wrong. Upon giving the command to recheck the results, OMNI then correctly calculated all the components of multiple linear regression. The last question was regarding making appropriate graphs for different variables. It produced all three charts accurately scoring 3/3 in 39.16 seconds. Figure 2 AI-generated graphs of a large dataset.
The time recorded in Table 1 was for the full OMNI response, which included the background analysis and the final output on the chat. The background data analysis of Omni, which uses Python library, was relatively fast. However, the primary time-consuming aspect of the full response was the generation of the results on the chat page and the comprehensive interpretation for the user to understand the results. In contrast to SPSS, the computational time of the background analysis on OMNI was still slightly longer. However, it should be noted that for SPSS, the statistician performing the analysis already overviews the data, manipulates the dataset and transforms variables before any analysis is done. OMNI on the other hand, performs all these steps itself from scratch, which accounts for the increased computational time. Moreover, presenting multiple questions within a single chat led to response delays and lagging. Also, if the chat was left unused for a long time and analysis was resumed on the same chat, the system lagged and produced errors. In both cases, a new chat had to be initiated. These errors which are otherwise unseen on SPSS.
Discussion
Artificial intelligence is beginning to revolutionize the field of medicine, medical education and health sciences by assisting doctors, students and faculty in medical diagnosis, improving the workflow of health systems, and assisting patients in processing their data. Moreover, it helps the students, researchers and faculty members.14-16 Our study aims to assess the accuracy of Chat GPT-4o in biomedical statistics. For this study, we chose OpenAI’s tool Omni-GPT which is available on the paid version of Chat-GPT. Our team chose datasets from a public domain and commanded AI to perform a few statistical tests. The results were then verified by cross-checking with the output produced by SPSS.
Our results showed that OMNI-GPT got a score of 85.9% and took 38.43 minutes to perform analysis on all datasets. It not only provided full transparency by showing us the background analysis based on Python libraries but also gave us comprehensive explanations for each step of the analysis and interpretations for even beginner-level users to understand. The rapid analysis of the datasets despite being provided raw Excel data, is something which cannot be performed on SPSS unless the data is manually inputted into the system, a feature of which is tedious and time-consuming. Such great speed performance was one of the key features of AI highlighted in this study and other literature where AI has been noted to effectively aid and speed up the process of scientific writing.17
The literature highlights the role of ChatGPT as a biostatistical tool. More recently, Ignjatović and Stevanović 2024 18 assessed the performance of ChatGPT (GPT-3.5 and GPT-4) in solving the biostatistical problems. The findings provided a piece of evidence about the performance of GPT-3.5 and GPT-4 in solving biostatistical problems. In the first 3 attempts, GPT-3.5 showed an average level of performance, while GPT-4 exhibited good performance.
Despite this reliable performance, many issues were noted. There was significant lag when multiple questions were asked within the same chat window. There were also errors and delays if the chat was left unused for a while and then questions resumed on it whereas often it was noted that the same command could omit essential features of the results while producing it at other times. While the results after performing the tests were mostly precise, the choice of the test sometimes produced inaccuracies because it failed to check whether the assumptions of those tests were met. Artificial intelligence is also highly dependent on the quality and cleanliness of the data it receives otherwise it is prone to give inaccurate results. Therefore, before inputting the data, it is essential to clean and modify it. In addition, it was noted that AI lacked basic reasoning for allotting data numerical values in a consecutive order.
In the current model, despite the capabilities of AI, human oversight is still needed to validate results and interpret findings. This suggests that AI cannot be solely relied upon and used especially by users who lack understanding and knowledge of data analysis and statistics. However, AI does show great potential and it can be expected from future updates and newer models to outperform not only this current data analysis tool but maybe statisticians as well, in terms of speed, interpretation for unskilled users, feasibility, and lack the need to manually enter, modify and clean raw data.
Risks and benefits exist for all innovations, as Figure 2 shows the Pros and Cons of both human-operated SPSS and AI-led Omni. However, given the great feasibility of AI, mankind must make use of the latest technology in the best way possible to benefit the scientific community, while keeping in mind current and future potential risks, and ensuring that nothing ever decreases or challenges human cognitive capacity. Such a societal crisis can be avoided by ensuring the safe use of technology whilst reducing reliance where possible.
Strengths and limitations
Our strength lies in the fact that this is a novel study on Omni and its performance and accuracy in comparison to human-operated traditional software SPSS. We reported both quantitative results (in terms of score and time) and qualitative measures (in terms of observations made by our team), providing deeper insights into AI capacity for data analysis whilst cross-checking the accuracy of the AI-produced results with reliable data analysis results from SPSS performed by our skilled team. Our limitations include time delays in ChatGPT analysis due to limitations in Wi-Fi strength, potentially impacting the reported response times for some variables. This could also possibly be different on alternative laptop/computer models or operating systems. Our study analyzed Omni-GPT, which is a customized GPT available on OpenAI’s platform Chat-GPT. Other AI models may show alternative performance. We also do not know the extent of the full capabilities or limitations of AI in statistics.
In conclusion, ChatGPT-Omni scored good, over 80% in all three statistical datasets in a short period. ChatGPT-Omni also produced accurate graphs and charts as commanded however it required explicit commands with clear instructions to avoid errors and omission of results to achieve appropriate results in biomedical data analysis. It shows significant potential, but still need to make considerable improvements in coding especially to address the technical errors encountered with multiple commands and the accuracy of results. Further research is also required to fully assess the accuracy of GPT-4o in biomedical statistics.
Acknowledgment
We thank the researchers supporting project number RSP 2024 R47, King Saud University, Riyadh, Saudi Arabia. We are thankful to Mrs. Hanan Ghaleb Mansour AlHajji, secretary Department of Physiology College of Medicine KSU for her help. We acknowledge Sofia Fields Author Services (https://www.sofiafields.com/) for the Editing language editing.
Footnotes
Disclosure. This study was funded by King Saud University, Riyadh, Kingdom of Saudi Arabia. Project No, RSP 2024 R47
- Received May 25, 2024.
- Accepted November 6, 2024.
- Copyright: © Saudi Medical Journal
This is an Open Access journal and articles published are distributed under the terms of the Creative Commons Attribution-NonCommercial License (CC BY-NC). Readers may copy, distribute, and display the work for non-commercial purposes with the proper citation of the original work.