As artificial intelligence (AI) systems increasingly permeate our health care industry, it is imperative that physicians take a proactive role in evaluating these novel technologies. AI-driven tools are reshaping diagnostics, treatment planning, and risk assessment, but with this transformation comes the responsibility to ensure that these systems are valid, reliable, and ethically deployed. A clear understanding of key concepts like validity, reliability, and the limitations of AI performance metrics is essential for making informed decisions about AI adoption in clinical settings.
Validity is the quality of being correct or true—in other words, whether and how accurately an artificial intelligence system measures (i.e., classifies or predicts) what it is intended to measure. Reliability refers to the consistency of the output of an artificial intelligence system, that is, whether the same (or a highly correlated) result is obtained under the same set of circumstances. Both need to be measured, and both need to exist for an artificial intelligence system to be trustworthy.
A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition such as a disease when the disease is not present, while a false negative is the opposite error where the test result incorrectly fails to indicate the presence of a condition when it is present. These are the two kinds of errors in a binary test, in contrast to the two kinds of correct results (a true positive and a true negative). Errors in health care AI predictions are common, particularly in binary classifications where only two outcomes (e.g., disease or no disease) are possible. False positives occur when the AI predicts a condition (such as a disease) when it is not present, while false negatives happen when the system fails to identify an existing condition. These errors are known as Type I and Type II errors in statistical hypothesis testing, and the balance between them plays a critical role in determining the AI’s overall performance.
Physicians need to be aware of these risks and critically assess whether an AI tool is optimized to balance false positives and false negatives appropriately. An AI system that minimizes one type of error may inadvertently increase the other, which can have serious consequences depending on the clinical context.
One of the most common performance metrics used to evaluate AI systems is accuracy, or the percent of correct predictions made by the model. However, physicians should be cautious about placing too much emphasis on this measure and should be aware of the accuracy paradox, which highlights the danger of relying on accuracy alone, especially in health care, where disease prevalence can vary significantly across populations. For example, if a health care AI model is designed to detect a rare condition, it may achieve high accuracy simply by predicting that most patients do not have the condition, but this would be of little clinical use. Instead, physicians should look at additional performance metrics like precision and recall. Precision measures the proportion of positive predictions that are actually correct, while recall assesses how well the AI system identifies all true positive cases. These AI metrics provide a more nuanced picture of how the health care AI tool performs, particularly in cases where certain outcomes, like identifying a rare but deadly condition, are more critical than others.
An important consideration for physicians in evaluating health care artificial intelligence is the phenomenon known as Goodhart’s Law, which states that “when a measure becomes a target, it ceases to be a good measure.” This is particularly relevant in health care AI, where developers may optimize algorithms to perform well on specific benchmarks, sometimes at the expense of the AI system’s broader clinical usefulness. For instance, a health care AI model optimized to achieve high accuracy on a public dataset might perform poorly in real-world clinical settings.
A famous Goodhart’s Law example is the cobra effect, where well-intentioned government policies inadvertently worsened the problem they were designed to solve. The British colonial government in India, concerned about the increasing number of venomous cobras in Delhi, began offering a bounty for each dead cobra that was delivered. Initially, this strategy was successful as locals brought in large numbers of slaughtered snakes. Over time, however, enterprising individuals started breeding cobras to kill them for supplemental income. When the government abandoned the bounty, the cobra breeders released their cobras into the wild, leading to a surge in Delhi’s snake population.
The cobra effect, where efforts to control a problem lead to unintended and often worse outcomes, serves as a cautionary tale for health care AI. If developers or health care institutions focus too narrowly on specific performance AI metrics, they risk undermining the system’s overall effectiveness, leading to suboptimal patient outcomes. Physicians must be vigilant in ensuring that health care AI systems are not only optimized for performance metrics but are also truly beneficial in practical, clinical applications.
Healthcare AI evaluation must go beyond simple benchmarks to prevent systems from becoming “too good” at hitting specific government targets, and instead ensure they remain robust in addressing the broader challenges they were designed to tackle. Goodhart’s Law warns us that relying solely on one AI performance metric can result in inefficiencies or even dangerous outcomes in health care settings. Therefore, physicians must understand that while AI can be a powerful health care tool, its performance must be carefully evaluated using hard empirical evidence to avoid undermining its intended purpose.
Physicians must also be aware of the ethical implications of AI in health care, where one key challenge is systematic bias within AI models, which can disproportionately affect certain patient populations. Efforts to equalize error rates across different demographic groups may compromise the calibration of a health care AI system, leading to imbalances in how accurately the health care AI system predicts outcomes for different populations.
In artificial intelligence, calibration refers to how accurately a model’s predictions reflect real-world outcomes. A well-calibrated AI system ensures that predicted probabilities match the actual likelihood of an event. Equalization, on the other hand, involves ensuring that different groups (e.g., racial or gender groups) experience similar rates of certain types of errors, like false positives or false negatives. Balancing these two can be challenging because improving calibration might lead to unequal error rates across groups, while equalizing errors may reduce overall accuracy, leading to the ethical dilemma of prioritizing fairness versus precision.
For example, if an AI tool used in risk assessment performs differently for different racial or ethnic groups, it could result in unequal medical treatment. This is especially concerning in health care, where biases in AI models could exacerbate existing health disparities. Physicians should advocate for transparency in how health care AI systems are trained and calibrated and demand that these tools undergo continuous evaluation to ensure they serve all patient populations fairly.
In a health care AI context, over-optimization for a specific AI metric can lead to unintended consequences, where improving one area, such as lowering false positives, leads to a spike in false negatives, potentially harming patients.
Ultimately, physicians must play a critical role in the evaluation and deployment of AI tools in health care. By understanding concepts like validity, reliability, precision, recall, Goodhart’s Law, and the accuracy paradox, they can better assess whether a given AI system is fit for clinical use. Furthermore, by advocating for transparency and fairness in how these systems are designed and applied, physicians can help ensure that AI is used ethically and effectively to improve patient care. As AI continues to evolve and integrate into health care, it is essential that physicians remain at the forefront of these changes, guiding the responsible and thoughtful use of this transformative technology.
Neil Anand is an anesthesiologist.