Population-wide evaluation of artificial intelligence and radiologist assessment of screening mammograms

Johanne Kühl; Mohammad Talal Elhakim; Sarah Wordenskjold Stougaard; Benjamin Schnack Brandt Rasmussen; Mads Nielsen; Oke Gerke; Lisbet Brønsro Larsen; Ole Graumann

doi:10.1007/s00330-023-10423-7

Population-wide evaluation of artificial intelligence and radiologist assessment of screening mammograms

Johanne Kühl, Mohammad Talal Elhakim^*, Sarah Wordenskjold Stougaard, Benjamin Schnack Brandt Rasmussen, Mads Nielsen, Oke Gerke, Lisbet Brønsro Larsen, Ole Graumann

^*Corresponding author for this work

Research output: Contribution to journal › Journal article › Research › peer-review

5 Downloads (Pure)

Abstract

Objectives: To validate an AI system for standalone breast cancer detection on an entire screening population in comparison to first-reading breast radiologists. Materials and methods: All mammography screenings performed between August 4, 2014, and August 15, 2018, in the Region of Southern Denmark with follow-up within 24 months were eligible. Screenings were assessed as normal or abnormal by breast radiologists through double reading with arbitration. For an AI decision of normal or abnormal, two AI-score cut-off points were applied by matching at mean sensitivity (AI _sens) and specificity (AI _spec) of first readers. Accuracy measures were sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and recall rate (RR). Results: The sample included 249,402 screenings (149,495 women) and 2033 breast cancers (72.6% screen-detected cancers, 27.4% interval cancers). AI _sens had lower specificity (97.5% vs 97.7%; p < 0.0001) and PPV (17.5% vs 18.7%; p = 0.01) and a higher RR (3.0% vs 2.8%; p < 0.0001) than first readers. AI _spec was comparable to first readers in terms of all accuracy measures. Both AI _sens and AI _spec detected significantly fewer screen-detected cancers (1166 (AI _sens), 1156 (AI _spec) vs 1252; p < 0.0001) but found more interval cancers compared to first readers (126 (AI _sens), 117 (AI _spec) vs 39; p < 0.0001) with varying types of cancers detected across multiple subgroups. Conclusion: Standalone AI can detect breast cancer at an accuracy level equivalent to the standard of first readers when the AI threshold point was matched at first reader specificity. However, AI and first readers detected a different composition of cancers. Clinical relevance statement: Replacing first readers with AI with an appropriate cut-off score could be feasible. AI-detected cancers not detected by radiologists suggest a potential increase in the number of cancers detected if AI is implemented to support double reading within screening, although the clinicopathological characteristics of detected cancers would not change significantly. Key Points: • Standalone AI cancer detection was compared to first readers in a double-read mammography screening population. • Standalone AI matched at first reader specificity showed no statistically significant difference in overall accuracy but detected different cancers. • With an appropriate threshold, AI-integrated screening can increase the number of detected cancers with similar clinicopathological characteristics.

Original language	English
Journal	European Radiology
Volume	34
Issue number	6
Pages (from-to)	3935-3946
ISSN	0938-7994
DOIs	https://doi.org/10.1007/s00330-023-10423-7
Publication status	Published - Jun 2024

Keywords

Artificial intelligence
Breast cancer
Mammography
Screening

Documents & Links

10.1007/s00330-023-10423-7Licence: CC BY

Open Access VersionFinal published version, 1.02 MBLicence: CC BY

1 Ph.D. thesis

Large-scale validation of artificial intelligence for breast cancer detection in Danish mammography screening
Elhakim, M. T., 21. Feb 2024, Syddansk Universitet. Det Sundhedsvidenskabelige Fakultet. 162 p.
Research output: Thesis › Ph.D. thesis

Open Access
File
247 Downloads (Pure)

1 Finished

MAGIC - Artificial intelligence for earlier and better detection of breast cancer in Danish mammography screening
Elhakim, M. T. (Project participant)
01/10/2020 → 31/12/2023
Project: PhD Project

Cite this

@article{362654a0153c437e860c34df06b19890,

title = "Population-wide evaluation of artificial intelligence and radiologist assessment of screening mammograms",

abstract = "Objectives: To validate an AI system for standalone breast cancer detection on an entire screening population in comparison to first-reading breast radiologists. Materials and methods: All mammography screenings performed between August 4, 2014, and August 15, 2018, in the Region of Southern Denmark with follow-up within 24 months were eligible. Screenings were assessed as normal or abnormal by breast radiologists through double reading with arbitration. For an AI decision of normal or abnormal, two AI-score cut-off points were applied by matching at mean sensitivity (AI sens) and specificity (AI spec) of first readers. Accuracy measures were sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and recall rate (RR). Results: The sample included 249,402 screenings (149,495 women) and 2033 breast cancers (72.6% screen-detected cancers, 27.4% interval cancers). AI sens had lower specificity (97.5% vs 97.7%; p < 0.0001) and PPV (17.5% vs 18.7%; p = 0.01) and a higher RR (3.0% vs 2.8%; p < 0.0001) than first readers. AI spec was comparable to first readers in terms of all accuracy measures. Both AI sens and AI spec detected significantly fewer screen-detected cancers (1166 (AI sens), 1156 (AI spec) vs 1252; p < 0.0001) but found more interval cancers compared to first readers (126 (AI sens), 117 (AI spec) vs 39; p < 0.0001) with varying types of cancers detected across multiple subgroups. Conclusion: Standalone AI can detect breast cancer at an accuracy level equivalent to the standard of first readers when the AI threshold point was matched at first reader specificity. However, AI and first readers detected a different composition of cancers. Clinical relevance statement: Replacing first readers with AI with an appropriate cut-off score could be feasible. AI-detected cancers not detected by radiologists suggest a potential increase in the number of cancers detected if AI is implemented to support double reading within screening, although the clinicopathological characteristics of detected cancers would not change significantly. Key Points: • Standalone AI cancer detection was compared to first readers in a double-read mammography screening population. • Standalone AI matched at first reader specificity showed no statistically significant difference in overall accuracy but detected different cancers. • With an appropriate threshold, AI-integrated screening can increase the number of detected cancers with similar clinicopathological characteristics.",

keywords = "Artificial intelligence, Breast cancer, Mammography, Screening",

author = "Johanne K{\"u}hl and Elhakim, {Mohammad Talal} and Stougaard, {Sarah Wordenskjold} and Rasmussen, {Benjamin Schnack Brandt} and Mads Nielsen and Oke Gerke and Larsen, {Lisbet Br{\o}nsro} and Ole Graumann",

year = "2024",

month = jun,

doi = "10.1007/s00330-023-10423-7",

language = "English",

volume = "34",

pages = "3935--3946",

journal = "European Radiology",

issn = "0938-7994",

publisher = "Springer",

number = "6",

}

TY - JOUR

T1 - Population-wide evaluation of artificial intelligence and radiologist assessment of screening mammograms

AU - Kühl, Johanne

AU - Elhakim, Mohammad Talal

AU - Stougaard, Sarah Wordenskjold

AU - Rasmussen, Benjamin Schnack Brandt

AU - Nielsen, Mads

AU - Gerke, Oke

AU - Larsen, Lisbet Brønsro

AU - Graumann, Ole

PY - 2024/6

Y1 - 2024/6

N2 - Objectives: To validate an AI system for standalone breast cancer detection on an entire screening population in comparison to first-reading breast radiologists. Materials and methods: All mammography screenings performed between August 4, 2014, and August 15, 2018, in the Region of Southern Denmark with follow-up within 24 months were eligible. Screenings were assessed as normal or abnormal by breast radiologists through double reading with arbitration. For an AI decision of normal or abnormal, two AI-score cut-off points were applied by matching at mean sensitivity (AI sens) and specificity (AI spec) of first readers. Accuracy measures were sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and recall rate (RR). Results: The sample included 249,402 screenings (149,495 women) and 2033 breast cancers (72.6% screen-detected cancers, 27.4% interval cancers). AI sens had lower specificity (97.5% vs 97.7%; p < 0.0001) and PPV (17.5% vs 18.7%; p = 0.01) and a higher RR (3.0% vs 2.8%; p < 0.0001) than first readers. AI spec was comparable to first readers in terms of all accuracy measures. Both AI sens and AI spec detected significantly fewer screen-detected cancers (1166 (AI sens), 1156 (AI spec) vs 1252; p < 0.0001) but found more interval cancers compared to first readers (126 (AI sens), 117 (AI spec) vs 39; p < 0.0001) with varying types of cancers detected across multiple subgroups. Conclusion: Standalone AI can detect breast cancer at an accuracy level equivalent to the standard of first readers when the AI threshold point was matched at first reader specificity. However, AI and first readers detected a different composition of cancers. Clinical relevance statement: Replacing first readers with AI with an appropriate cut-off score could be feasible. AI-detected cancers not detected by radiologists suggest a potential increase in the number of cancers detected if AI is implemented to support double reading within screening, although the clinicopathological characteristics of detected cancers would not change significantly. Key Points: • Standalone AI cancer detection was compared to first readers in a double-read mammography screening population. • Standalone AI matched at first reader specificity showed no statistically significant difference in overall accuracy but detected different cancers. • With an appropriate threshold, AI-integrated screening can increase the number of detected cancers with similar clinicopathological characteristics.

AB - Objectives: To validate an AI system for standalone breast cancer detection on an entire screening population in comparison to first-reading breast radiologists. Materials and methods: All mammography screenings performed between August 4, 2014, and August 15, 2018, in the Region of Southern Denmark with follow-up within 24 months were eligible. Screenings were assessed as normal or abnormal by breast radiologists through double reading with arbitration. For an AI decision of normal or abnormal, two AI-score cut-off points were applied by matching at mean sensitivity (AI sens) and specificity (AI spec) of first readers. Accuracy measures were sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and recall rate (RR). Results: The sample included 249,402 screenings (149,495 women) and 2033 breast cancers (72.6% screen-detected cancers, 27.4% interval cancers). AI sens had lower specificity (97.5% vs 97.7%; p < 0.0001) and PPV (17.5% vs 18.7%; p = 0.01) and a higher RR (3.0% vs 2.8%; p < 0.0001) than first readers. AI spec was comparable to first readers in terms of all accuracy measures. Both AI sens and AI spec detected significantly fewer screen-detected cancers (1166 (AI sens), 1156 (AI spec) vs 1252; p < 0.0001) but found more interval cancers compared to first readers (126 (AI sens), 117 (AI spec) vs 39; p < 0.0001) with varying types of cancers detected across multiple subgroups. Conclusion: Standalone AI can detect breast cancer at an accuracy level equivalent to the standard of first readers when the AI threshold point was matched at first reader specificity. However, AI and first readers detected a different composition of cancers. Clinical relevance statement: Replacing first readers with AI with an appropriate cut-off score could be feasible. AI-detected cancers not detected by radiologists suggest a potential increase in the number of cancers detected if AI is implemented to support double reading within screening, although the clinicopathological characteristics of detected cancers would not change significantly. Key Points: • Standalone AI cancer detection was compared to first readers in a double-read mammography screening population. • Standalone AI matched at first reader specificity showed no statistically significant difference in overall accuracy but detected different cancers. • With an appropriate threshold, AI-integrated screening can increase the number of detected cancers with similar clinicopathological characteristics.

KW - Artificial intelligence

KW - Breast cancer

KW - Mammography

KW - Screening

U2 - 10.1007/s00330-023-10423-7

DO - 10.1007/s00330-023-10423-7

M3 - Journal article

C2 - 37938386

SN - 0938-7994

VL - 34

SP - 3935

EP - 3946

JO - European Radiology

JF - European Radiology

IS - 6

ER -

Population-wide evaluation of artificial intelligence and radiologist assessment of screening mammograms

Abstract

Keywords

Documents & Links

Fingerprint

Related research output

Large-scale validation of artificial intelligence for breast cancer detection in Danish mammography screening

Related projects

MAGIC - Artificial intelligence for earlier and better detection of breast cancer in Danish mammography screening

Cite this