Despite many technological advances in mammographic imaging over the last century, breast cancer screening is largely the same as it was when first studied 60 years ago. All that is perhaps about to change. Whereas many of the advances in recent years have been linked to producing better images, the sea-change we are about to witness will be related to how “we” look at those images. I’ve put we in inverted commas because the revolution involves computers interpreting mammograms alongside people. And new research, making headlines this week,shows that they can do so as well as radiologists – in 2020, it seems as though computers are acquiring 20/20 vision! And if artificial intelligence (AI) is as good as humans today, you can be sure that it will be outperforming the best radiologists within a few years.
As the new paper, published in Nature, concludes “This robust assessment of the AI system paves the way for prospective clinical trials to improve the accuracy and efficiency of breast cancer screening.” In this post, Peter Sasieni will discuss the top-line results. And in future posts, he’ll explore the questions the study raises.
Building the AI System
The study was a collaboration between Google Health, Deep Mind, Cancer Research UK-funded researchers at Imperial College and elsewhere in England, and researchers in the USA. It is a little difficult to follow exactly how the artificial intelligence system was developed, but it would appear that it used anonymised screening mammography images from just over 14,000 UK women to train the system, and nearly 77,000 more to fine-tune it. Additionally, mammograms from some 12,000 US women were used to train the system further. But whereas all screening mammograms were included from the UK, only a 5% sample of the women without breast cancer were included in the US training set, so approximately 22% of the 12,000 US women included had breast cancer.
That sounds like a lot of images, but the authors wanted more! They used software to create random perturbations of the actual images using elastic deformation (imagine the image is on a trampoline and you pull the surface of the trampoline at a point), shearing (imagine stretching the image between two hinged arms), and rescaling (turn a large breast into a small breast or vice versa).
In order to train the system, the authors wanted to define the “truth” (breast cancer or no breast cancer) based on a diagnosis of breast cancer resulting from the screening mammogram, in the interval between screens, or as a result of the next screen. In practice, they used breast cancer diagnosis within 27 months of the screening mammography in the US (where breast cancer screening is carried out with an interval of between one and two years between screens) and 39 months in the UK (where breast screening is offered once every 36 months). The computer did not have access to past images (from previous screening rounds) but was told the age of the woman.
The AI system used three deep learningmodels each trying to identify breast cancer at a different level: one classifies individual lesions; one classifies each breast separately; and one classifies the woman. Each model, it would seem, tries to assign a probability of breast cancer and the final system uses the average (mean) of these scores.
Having developed a system that they were happy with, the researchers tested it on anonymised images from 25,856 women (414 of whom had breast cancer) from the UK and 3,097 women (686 of whom had breast cancer) from the US. The system was used retrospectively to see how it might have done if used on the original images. The AI system was not used to alter the management of the women in the study.
On the UK test cases, the AI system, was able to detect 68% of breast cancers (sensitivity 68%) whilst only incorrectly flagging 7% (specificity 93%) of women who did not have breast cancer. For comparison, the paper reports that a single reader in the UK has a sensitivity of 63%-69% with a specificity of 92-94%, and the consensus result has a sensitivity of 67% with a specificity of 96%.
Both the AI system and the human reader preformed less well on the US test set, but the AI system performed better than the human reader: sensitivity 53% (compared with 48% for human), specificity 86.5% (compared with 81% for human).
The authors discuss two specific application of the AI system. The first is to use it in place of a second reader (second readers are routinely used in the UK). If the AI system agrees with the first reader, there would be no need for a second human reader. The authors estimate that used in this way, the AI system could reduce the numbers of women’s whose mammograms need to be read by a second reader by 88%. The overall performance of the screening programme would be very similar to that achieved currently.
The second application discussed by the authors is to identify a group of women with extremely low risk of breast cancer. In the UK test set, they estimate that the AI system could identify 40% of women at very low risk and that their risk (of breast cancer diagnosed now or in the next 12 months) would be 1 in 10,000.
In 2020, AI systems will not be taking over from humans, but the evidence from this study is that they have a clear role in supplementing human readers. But it also raises some interesting questions, which I’ll be exploring in a series of posts over the coming days.
Other links to this story
More on breast cancer screening in the UK
The views expressed are those of the author. Posting of the blog does not signify that the Cancer Prevention Group endorse those views or opinions.