Previously we reported on a major study evaluating AI as a tool to identify breast cancer on mammography.
The results were impressive. In the UK, AI performed almost as well as the screening programme: it would achieve similar sensitivity (67.9% compared with 67.4%) with only a small loss of specificity (93.0% compared with 96.2%). In the US, AI outperformed a human reader: sensitivity 57.5% (compared with 48.1%), specificity 86.5% (compared with 80.8%). An obvious question is: why are the results so much better in the UK?
Even restricting focus to cancers diagnosed within 12 months of mammography, the performance of both human and AI in the UK was better than in the US. The human reader in the US had a sensitivity of 84% with a specificity of 81%, whereas in the UK the (first reader) achieved 88% sensitivity and 93% specificity. The AI system too performed better in the UK. In the UK it essentially reproduced the performance of the (first) human reader. In the US it was slightly worse than the human reader (and therefore substantially worse than in the UK).
Why does AI (and humans) perform less well in the US than in the UK?
If anything, one might expect the results to have been worse in the UK because the study included breast cancers diagnosed up to 39 months after the mammograms were taken in the UK compared with just 27 months in the US. The main difference between the US and the UK data is that nearly half (46.5%) of the US women were aged under 50, compared with just 6.6% of the UK women. Why should that make a difference? Well, before the menopause women are much more likely to have “dense” breasts than they are after the menopause. And dense breasts make it much harder to spot a cancer on mammography. For instance, in this paper the sensitivity – both human and computer – in women with extremely dense breasts was very poor (only 3 out of 12 cancers were detected by the human reader; 2 out of 12 by AI). But this cannot be the whole story because even in women with non-dense breasts, the sensitivity in the US was below 50% for the human (and no more than 60% for the AI system).
Another possible explanation is that 65% of the UK cancers were detected at their initial (i.e. study baseline) screen, whereas only 44% of US cancers were. Cancers detected at the initial screen were, by definition, screen-detected (and therefore tend to be easier to detect both by humans and by AI).
The precise reason for the poorer performance of the AI system in the US than in the UK is unclear and needs to be investigated. US breast cancer screening typically only uses a single user and the AI system was able to attain a better sensitivity (57% vs 48%) with a better specificity (86% compared with 81%) than the human reader. In the UK breast cancer screening programme, all mammograms are viewed by two readers and disagreements are adjudicated by a third reader. The AI system had similar sensitivity to the UK breast cancer screening programmes (68% vs 67%) with only a moderate loss in specificity (93% vs 96%).
It is to be hoped that the collation of hundreds of thousands of images from the US and the UK together with clinical follow-up will allow better understanding of the reasons for the differences in performance of breast screening in the two countries. Are mammograms better in the UK? Is there more overdiagnosis in the US? These are important questions for future studies to explore.
In our next post, I’ll explore what the new findings could mean in practice for screening in the UK.
Other links to this story
More on breast cancer screening in the UK
The views expressed are those of the author. Posting of the blog does not signify that the Cancer Prevention Group endorse those views or opinions.