One of my biggest blogs of 2022 asked whether low Her-2 slide readings would be a breakthrough moment for digital pathology (here).
A major collaboration sponsored by Friends of Cancer Research (FoCR) was presented at San Antonio Breast Conference this week. The upshot - the inter-rater agreement of whole slide imaging / AI (WSI-AI) was about as good as the inter-rater agreement of human experts.
- Find the press release here.
- Find the poster, McKelvey et al., here.
- Here more results at the FoCR in person and virtual conference, February 4, 2025. Here.
- Washington, DC – December 12th, 2024 - New findings by Friends of Cancer Research (Friends) and collaborators were presented yesterday at the San Antonio Breast Cancer Symposium (SABCS), “Agreement Across 10 Artificial Intelligence Models in Assessing HER2 in Breast Cancer Whole Slide Images: Findings from the Friends of Cancer Research Digital PATH Project.”
- The poster presented the agreement of HER2 biomarker assessment across independently developed computational pathology models. These preliminary findings suggest that the level of agreement of predicted HER2 scores across models is similar to published agreement measures across pathologists.
- “It is crucial to have accurate and consistent identification of patients who may benefit from targeted therapies such as antibody-drug conjugates (ADCs),” said Dr. Brittany Avin McKelvey, Director of Regulatory Affairs at Friends. “The Digital PATH Project’s unique collaborative approach enables us to explore the potential of AI models to deliver more quantitative and reproducible biomarker assessments and helps address the current lack of large-scale comparative evaluations of performance.”
- The Digital PATH Project launched in February 2024.
Summary: Agreement Across AI Models in HER2 Assessment for Breast Cancer
This poster from the Friends of Cancer Research Digital PATH Project evaluates the agreement across 10 independently developed artificial intelligence (AI) models in assessing HER2 status in breast cancer using whole slide images (WSIs).
Introduction
HER2-targeted therapies, including antibody-drug conjugates, have expanded the patient population benefiting from such treatments. Accurate HER2 scoring is critical, yet variability exists between pathologists. AI models offer a quantitative alternative, but their performance variability remains understudied.
Methods
- Samples: WSIs (H&E and HER2 IHC) from 1,124 breast cancer patients (2021 cohort, ZAS Hospital, Belgium).
- HER2 scores were evaluated by three pathologists.
- Models: 10 AI models from 9 developers were analyzed. Models varied in inputs and outputs (e.g., predicted ASCO/CAP scores, H-scores).
- Analysis: Agreement was assessed without a defined reference standard using metrics such as Overall Percent Agreement (OPA) and Cohen’s kappa.
Results
Patient Cohort:
- Median age: 65.
- 94.3% with de novo diagnoses; 98.6% female.
- Histology: 78.2% ductal, 15.3% lobular.
- HER2 status: 51.6% HER2-negative, 48.4% HER2-low or positive.
Agreement Across Models:
- Higher agreement for HER2 3+ (strongly positive) and 0 (negative) categories.
- Greater variability for intermediate categories (1+ and 2+).
- Agreement measures:
- Categorical (0, 1+, 2+, 3+): Median OPA = 65.1%; Cohen’s kappa = 0.51.
- Binary (e.g., 3+ vs. others): OPA = 97.3%; Cohen’s kappa = 0.86.
Figures:
- Figure 1: Heatmap of HER2 scores by 7 models, showing clustering of samples with high or low HER2 scores.
- Figure 2: Confusion matrix highlighting most disagreements between 1+ and 2+ scores.
- Figure 3: Violin plots displaying pairwise agreement measures across models (OPA and kappa).
Conclusions
- HER2 3+ cases showed the least variability and highest agreement.
- Intermediate scores (1+, 2+) had significant inter-model variability.
- Trends in AI model agreement mirror those seen among pathologists.
This study provides insight into variability across AI models and supports developing best practices for AI-driven biomarker assessments.
Next Steps
- Deeper analyses on patient, specimen, and model characteristics.
- Comparison of AI model outputs with pathologist readings.
- Public meeting (February 4) to discuss findings, policy implications, and recommendations for reference sets.
This study reinforces the potential of AI models in clinical biomarker assessment while highlighting areas requiring standardization to ensure reproducibility and reliability. For details on data visualization, review Figures 1–3, which depict scoring trends and inter-model agreement.