Article Type : Review Article
Title :   Best Practices for Compiling Test Datasets to Evaluate Artificial Intelligence Models in Pathology
Authors :   Shubham Belekar
Abstract :   Artificial intelligence (AI) has revolutionized pathology by enhancing diagnostic accuracy, efficiency, and workflow automation. However, the reliability and generalizability of AI models in pathology depend critically on the quality of the test datasets used for evaluation. Properly curated test datasets ensure robust validation, minimize bias, and improve clinical applicability. This review discusses key recommendations for compiling test datasets for AI evaluation in pathology, emphasizing dataset diversity, representativeness, annotation standards, ethical considerations, and regulatory compliance. We also highlight challenges such as dataset shifts, class imbalances, and privacy concerns while providing solutions for mitigating these issues. Standardized dataset compilation will enable the development of trustworthy AI systems and facilitate regulatory approval.
Introduction :   Artificial intelligence (AI) is transforming the field of pathology by enabling automated image analysis, anomaly detection, and diagnostic support [1]. Machine learning (ML) and deep learning (DL) models have shown remarkable potential in diagnosing diseases such as cancer, infectious diseases, and neurodegenerative disorders [2-6]. However, evaluating these AI solutions requires high-quality test datasets that reflect real-world clinical variability [7-8]. Poorly compiled test datasets can lead to biased models, incorrect generalization, and unreliable AI performance [9]. This review outlines best practices and recommendations for compiling test datasets for evaluating AI solutions in pathology. We discuss dataset composition, data diversity, annotation quality, ethical considerations, and regulatory requirements to ensure robust model evaluation.
Review of Literature :  Data Diversity and Representativeness[6-9] A well-compiled test dataset should be representative of real-world pathology cases, encompassing: Demographic diversity: Inclusion of samples from different age groups, ethnicities, and genders to avoid biases. Geographical diversity: Data from multiple institutions worldwide to capture regional variations in pathology. Tissue and disease variability: A mix of normal, pre-malignant, and malignant cases across different organs and conditions. Imaging techniques: Variation in staining techniques (H&E, IHC, special stains), scanner types, and resolutions. 2.2. Annotation Quality and Ground Truth Standards [10-12] Accurate annotation is critical for reliable AI evaluation. Recommendations include: Expert consensus: Pathologists with different levels of expertise should independently annotate and resolve disagreements. Multi-modal validation: Use of clinical reports, molecular data, and follow-up records to confirm ground truth. Annotation granularity: Labeling at multiple levels (whole slide, region, cellular, subcellular) depending on the AI task. Inter- and intra-observer variability assessment: Quantifying consistency among annotators to measure uncertainty in labels.
Discussion :  Dataset Size and Class Balancing [13-15] Sufficient dataset size: The test dataset should contain an adequate number of samples to capture disease heterogeneity. Balanced classes: Addressing underrepresented classes (e.g., rare cancers) through curated sampling or synthetic augmentation techniques. Real-world prevalence representation: Ensuring that disease prevalence in the dataset mirrors clinical practice to avoid misleading performance metrics. 2.4. Data Augmentation and Preprocessing Considerations [15] Standardized preprocessing pipelines: Uniform preprocessing steps such as normalization, stain normalization, and resizing to avoid biases. Augmentation transparency: Documenting any augmentation techniques used to prevent performance overestimation. 3. Challenges in Test Dataset Compilation 3.1. Dataset Shift and Domain Adaptation Issues [13] AI models may fail when exposed to new data distributions due to dataset shifts, including: Acquisition shift: Differences in scanner hardware and staining protocols across institutions. Population shift: Variations in disease presentation across demographic groups. Temporal shift: Changes in medical practices and diagnostic criteria over time. Mitigation Strategies: Multi-center data collection Domain adaptation techniques such as adversarial training Continuous model validation on new data 3.2. Ethical and Privacy Concerns [12] De-identification: Ensuring patient data privacy through anonymization techniques. Informed consent: Obtaining patient consent when required, especially for publicly shared datasets. Bias mitigation: Identifying and correcting biases in dataset representation. 3.3. Regulatory and Standardization Challenges Regulatory bodies such as the FDA and CE require AI models in pathology to be tested on standardized datasets. Adherence to FAIR principles (Findability, Accessibility, Interoperability, and Reusability) Use of benchmark datasets such as TCGA, CAMELYON, and MIDOG to facilitate regulatory approval.
Conclusion :  Compiling high-quality test datasets is crucial for evaluating AI solutions in pathology. A well-curated dataset should be diverse, accurately annotated, ethically compliant, and standardized for regulatory approval. Addressing challenges such as dataset shifts, class imbalances, and privacy concerns will lead to more robust AI models that generalize effectively in clinical practice. Future efforts should focus on developing large-scale, publicly available benchmark datasets to facilitate AI advancements in pathology.
References :  
1. Litjens, G., et al. (2017). "A survey on deep learning in medical image analysis." Medical Image Analysis, 42, 60-88.
2. Esteva, A., et al. (2019). "A guide to deep learning in healthcare." Nature Medicine, 25(1), 24-29.
3. Campanella, G., et al. (2019). "Clinical-grade computational pathology using weakly supervised deep learning on whole slide images." Nature Medicine, 25(8), 1301-1309.
4. Bulten, W., et al. (2020). "Artificial intelligence for diagnosis and Gleason grading of prostate cancer: The PANDA challenge." Nature Medicine, 26(9), 1410-1416.
5. Tolkach, Y., et al. (2020). "Deep learning-based classification of tissue sections in prostate cancer." Clinical Cancer Research, 26(17), 4447-4456.
6. Yu, K. H., et al. (2018). "Predicting clinical outcomes from histopathology images using multi-scale deep learning." Nature Medicine, 24(10), 1622-1629.
7. Van der Laak, J., et al. (2021). "Deep learning in histopathology: The path to the clinic." Nature Medicine, 27(5), 775-784.
8. Echle, A., et al. (2021). "Deep learning in cancer pathology: A new generation of clinical biomarkers." Nature Reviews Clinical Oncology, 18(6), 435-450.
9. Korbar, B., et al. (2017). "Deep learning for classification of colorectal polyps on whole-slide images." Journal of Pathology Informatics, 8, 30.
10. Bankhead, P., et al. (2017). "QuPath: Open-source software for digital pathology image analysis." Scientific Reports, 7, 16878.
11. Diao, J. A., et al. (2021). "Bias mitigation in machine learning for clinical decision support." The Lancet Digital Health, 3(10), e635-e644.
12. CAMELYON Consortium. (2018). "CAMELYON16 dataset for lymph node metastasis detection." IEEE Transactions on Medical Imaging, 37(5), 1154-1166.
13. The Cancer Genome Atlas (TCGA). (2020). "A comprehensive dataset for cancer research." Nature Genetics, 52(10), 943-950.
14. MIDOG Challenge. (2022). "Machine learning in digital pathology." Medical Image Analysis, 76, 102202.
15. FDA. (2021). "Regulatory framework for AI in medical imaging." Federal Register, 86(5), 320-330.