Abstract
Objective
Artificial intelligence (AI)-based large language models (LLMs), such as ChatGPT-4.0, are increasingly being considered for clinical decision-making support. However, their reliability in providing clinical recommendations for varicocele-related infertility remains to be thoroughly evaluated. This study aimed to evaluate the reliability of ChatGPT-4.0 in providing clinical recommendations for patients with varicocele-related infertility.
Materials and Methods
A standardized clinical scenario was created involving a 32-year-old male with varicocele and oligoasthenoteratozoospermia, including details from physical examination, hormonal profile, and semen analysis based on the World Health Organization 6th edition criteria. Sixteen diagnostic and therapeutic questions were developed and submitted to ChatGPT-4.0. The AI-generated responses were reviewed by 24 experienced urologists specializing in varicocele management, who rated the recommendations using a 5-point Likert scale.
Results
The urologists demonstrated an 80.2% agreement, 10.7% disagreement, and 9.1% neutrality with ChatGPT-4.0 recommendations. For 14 of the 16 questions, the majority of urologists either agreed or strongly agreed with ChatGPT-4.0. Recommendations regarding varicocelectomy indication, antioxidant usage, the female partner age greater than 35, follow-up after varicocelectomy, testosterone deficiency, and normospermic varicocele showed the highest consensus. However, lower agreement rates were noted for microsurgical varicocelectomy (54.1%) and preoperative sperm cryopreservation (16.7%).
Conclusion
ChatGPT-4.0 demonstrates reliability in providing clinical recommendations in most scenarios related to varicocele treatment, showing strong agreement with expert clinicians. However, specific “gray zone” scenarios requiring individualized decision-making highlight limitations; emphasizing the importance of experienced clinical judgment. ChatGPT-4.0 can serve as a reliable informational tool regarding varicocele treatment but should be used with caution in complex clinical decisions requiring personalized evaluation.
What’s known on the subject? and What does the study add?
Large language models like ChatGPT are increasingly used by both clinicians and patients to obtain medical information. However, their accuracy and alignment with expert recommendations in specific urological conditions such as varicocele remain unclear. Previous studies evaluating ChatGPT have focused mainly on general medical knowledge or patient education. This study is the first to systematically assess the clinical reliability of ChatGPT-4.0 in varicocele-related infertility scenarios using structured expert evaluation. It demonstrates that ChatGPT-4.0 provides recommendations largely consistent with expert opinion, especially in guideline-based standard cases, while highlighting its limitations in gray-zone decisions that require individualized clinical judgment.
Introduction
Male infertility is a significant reproductive health issue affecting approximately 15% of couples, with male factors being the primary cause in 40-50% of these cases (1). One of the most common causes of male infertility is varicocele, which is observed in 35-40% of subfertile men and up to 15% of the general population (2, 3). The effects of clinical varicocele on infertility have been studied for many years, and surgical treatment—particularly microscopic subinguinal varicocelectomy—has been reported to improve sperm parameters and increase spontaneous pregnancy rates (4, 5).
The management of varicocele cases remains controversial in “gray zone” situations such as mild sperm abnormalities, grade I varicocele, or azoospermia (6, 7). Treatment decisions are typically based on a holistic evaluation of multiple factors, including the patient’s clinical findings, semen analysis results, female partner’s age, and the couple’s reproductive expectations (8, 9).
In recent years, the integration of artificial intelligence (AI) systems into clinical decision-making processes has accelerated. Artificial intelligence–based large language models (LLMs), especially advanced versions such as ChatGPT-4.0, have begun to be used experimentally in medical education and clinical decision support (10). Nowadays, patients frequently consult internet sources and LLMs for health-related issues. However, the extent to which LLMs are beneficial to patients and the accuracy of their recommendations in specific clinical scenarios remain unclear (11). The consistency of LLMs with expert opinions and the reliability of the information they provide have been assessed in only a limited number of studies (12).
In this study, diagnostic and therapeutic suggestions provided by ChatGPT-4.0 were obtained for an infertile male patient diagnosed with varicocele, as well as, for different clinical scenarios of varicocele. The clinical responses generated by ChatGPT for each scenario were evaluated by urology specialists. The aim was to evaluate the reliability of artificial intelligence-based language model recommendations by clinicians for varicocele disease.
Materials and Methods
Study Design
No patient identifiers were used in this study. LLMs are based on publicly available information. Since there were no human subjects involved, ethical approval was not required for studies related to LLMs (13, 14). In this study, an AI–based language model, ChatGPT-4.0, was utilized. This observational study was conducted in April and May 2025. The Institutional Ethics Committee of University of Health Sciences Türkiye, Gülhane Training and Research Hospital reviewed the study design and concluded that ethical approval was not necessary. This decision was based on the absence of patient data, the exclusive collection of anonymous professional opinions, and the non-interventional nature of the survey. Real patient data was not included in this study. No identifiable personal or health information was obtained, and all responses remained anonymous. The study was conducted in accordance with the principles outlined in the Declaration of Helsinki (2004 revision).
Creating Case Scenarios
The survey questions were created by the authors (F.Y.I, E.B., Y.K.T. and S.B.). A case scenario was designed based on a 32-year-old infertile male patient diagnosed with varicocele. The case included physical examination findings, hormone profile, scrotal Doppler ultrasonography results, and semen analysis (according to the 6th edition of the World Health Organization classification) (15). Subsequently, multiple clinical scenarios were generated by modifying various parameters such as sperm count, female partner’s age, and hormone levels. A total of 16 questions were developed.
The initial scenario used in this study was based on a representative index case constructed to reflect a typical clinical presentation of varicocele-related infertility. To evaluate a broader range of decision-making contexts, this base scenario was systematically modified to include varying clinical parameters, thereby creating more complex clinical situations. Although no real patient data were used, the final set of scenarios was designed to closely resemble real-life cases.
To assess the clinical realism of the case scenarios, participants were asked to rate the similarity of the scenarios to actual clinical practice using a 5-point Likert scale. Remarkably, 100% of the respondents selected “strongly agree”, indicating that the scenarios were perceived as highly reflective of real-world cases.
AI-based Clinical Decision Generation and Expert Opinion Survey
Diagnostic and therapeutic recommendations for infertile male case scenarios with varicocele were generated by ChatGPT-4.0. These responses were evaluated by 24 specialists in a survey of urologists. The evaluation was conducted and recorded through Google Forms. Urologists with experience in the diagnosis, follow-up, and treatment of varicocele were included in the study. Participants assessed the responses of ChatGPT-4.0 anonymously. Participation was voluntary, and no incentives were provided. No random sampling was performed. The online survey utilized a 5-point Likert scale to evaluate ChatGPT-4.0. The specialists were asked to rate their level of agreement with each recommendation using the Likert scale (1 = strongly disagree, 5 = strongly agree).
Statistical Analysis
The collected data were analyzed statistically using SPSS version 26.0 (IBM, Armonk, NY, USA). The data distribution was tested using the Kolmogorov–Smirnov test. For the comparison based on the experience level of urologists, an independent-samples t-test was used to assess the mean Likert scale scores. A p-value of <0.05 was considered statistically significant. Inter-rater agreement among the 24 participating urologists was assessed using Fleiss’ kappa for multiple raters. The original 5-point Likert scale responses were dichotomized into “Agree” (scores 4-5) and “Not Agree” (scores 1-3) to facilitate interpretation. Kappa values were interpreted according to Landis and Koch’s criteria, where <0.20 indicates slight agreement, 0.21-0.40 fair agreement, 0.41–0.60 moderate agreement, 0.61-0.80 substantial agreement, and >0.80 almost perfect agreement.
Results
Clinical Features
A 32-year-old male patient presented with a two-year history of infertility. His 29-year-old female partner had no abnormalities on gynecological evaluation. The patient’s testicular volumes were 18 mL on the right and 16 mL on the left. Physical examination revealed a grade II varicocele on the left side. Scrotal Doppler ultrasonography showed a left pampiniform plexus vein diameter of 3.4 mm during the Valsalva maneuver, with a reflux duration of 2.1 seconds. Hormonal analysis revealed follicle-stimulating hormone levels were 4.1 IU/L, luteinizing hormone levels were 3.5 IU/L, and total testosterone levels were 4.8 ng/mL. The semen analysis results of the representative case created for the study are presented in Table 1, which were consistent with varicocele-associated oligoasthenoteratozoospermia.
ChatGPT-4.0 Responses and Survey Evaluation
Among the 24 participating urology specialists, 12 had <5 years (group A) and 12 had >5 years (group B) of clinical experience. Additionally, twelve were employed at tertiary care centers, eight at state hospitals, and four in private institutions. The clinical questions posed to ChatGPT-4.0 and its corresponding responses are presented in Table 2. There were no statistically significant differences between groups A and B in terms of the mean scores assigned to any of the questions (p<0.05).
The distribution of responses from 24 urologists to the 16 clinical questions is illustrated in Figure 1. Each stacked bar depicts the proportion of participants selecting each category on the 5-point Likert scale, ranging from “Strongly disagree” to “Strongly agree.” Visual inspection of the figure shows clear patterns of consensus in certain scenarios, particularly those with guideline-based management recommendations, while other scenarios, especially those in “gray zones,” reveal more heterogeneous response distributions. This graphical representation complements the tabulated data and highlights the variation in expert agreement across all questions. Urologists who evaluated ChatGPT-4.0’s response to 16 questions had 80.2% agreement, 10.7% disagreement, and 9.1% neutral responses. The distribution of expert ratings for each question is shown in Figure 1. For 14 out of 16 questions, the vast majority of urologists selected either “Agree” or “Strongly Agree” with ChatGPT-4.0.
The ChatGPT-4.0 responses regarding varicocelectomy indication (Question 1, 100%), antioxidant recommendation (Question 3, 91.7%), female partner age >35 (Question 6, 91.7%), follow-up after varicocelectomy (Question 7, 91.7%), testosterone deficiency (Question 10, 91.7%), and normospermic varicocele (Question 12, 95.8%) had the highest consensus among clinicians.
The discussion between ChatGPT-4.0 and urologists regarding microsurgical varicocelectomy (Question 4) showed an agreement rate of 54.1% (13/24), a neutral rate of 20.9% (5/24), and a disagreement rate of 25% (6/24). Regarding preoperative sperm cryopreservation (Question 5), the agreement rate among ChatGPT-4.0 and urologists was 16.7% (4/24), the neutral rate was 33.3% (8/24), and the disagreement rate was 50% (12/24).
Inter-rater agreement was re-evaluated by considering only the “Agree” (scores 4-5) and “Disagree” (scores 1-2) categories, excluding “Neutral” (score 3) responses. Using this approach, the overall Fleiss kappa value was 0.267, indicating fair to moderate agreement according to the Landis and Koch classification. This method eliminates the influence of indecisive responses and allows a clearer assessment of the consistency between positive and negative judgments.
Discussion
This study assessed the concordance between the clinical recommendations of ChatGPT-4.0 for varicocele-related infertility and the evaluations of practicing urologists. The high overall agreement suggests that ChatGPT-4.0 is capable of generating recommendations aligned with expert perspectives, particularly when guideline-based indications are clear.
Disagreement was most evident in scenarios involving microsurgical varicocelectomy and preoperative sperm cryopreservation. These differences likely reflect variations in institutional resources, surgical expertise, and the absence of definitive guidance in current clinical recommendations. For example, while microsurgical varicocelectomy is widely regarded as the standard of care, practical limitations may influence decision-making in certain settings. Similarly, the lack of strong evidence regarding preoperative sperm cryopreservation leads to heterogeneity in clinical practice.
The model’s suggestion to incorporate antioxidant therapy and to adjust treatment strategies when the female partner is older than 35 years was broadly endorsed by experts, reflecting alignment with both current literature (4, 5) and international guidelines (8, 9). The consistency of agreement across urologists with different levels of experience further suggests that LLMs may offer uniform, guideline-consistent information to patients, regardless of the clinician’s background.
Despite these strengths, limitations remain in “gray zone” scenarios, where nuanced clinical judgment is essential and standardized recommendations are lacking. This reinforces the view expressed in prior literature (16, 17) that AI tools should serve as complementary aids rather than replacements for clinician expertise.
Study Limitations
This study has several limitations. First, the performance of ChatGPT-4.0 was evaluated based on fixed clinical scenarios, which may not fully reflect the dynamic nature of real-life patient interactions. Second, the number of participating urologists was limited, which may restrict the generalizability of expert opinions. Previous studies have shown that LLMs may occasionally produce inaccurate responses that appear correct and may demonstrate inconsistencies when dealing with complex clinical scenarios (18-20). Therefore, although ChatGPT-4.0 showed high agreement with expert opinions in many scenarios, it should be used cautiously in complex cases.
Conclusion
While AI-based language models such as ChatGPT-4.0 demonstrate a high level of consistency with clinical guidelines and expert consensus in standard index cases, their utility remains limited in complex or low-evidence scenarios where nuanced clinical judgment is essential. In such contexts, expert opinion remains irreplaceable, and current AI systems cannot substitute for the depth and flexibility of experienced clinical reasoning.


