This blog post was written by Amy Tickle, a third-year PhD student supervised by Professor Peter Sasieni. Amy’s PhD is looking at monitoring progress in early diagnosis of cancer. Amy completed her masters in Epidemiology at Imperial College London in 2019. She recently joined the Equality, Diversity, and Inclusivity Group within the Cancer Prevention Group (CPG).
For researchers to make valid recommendations for cancer prevention, high quality data (on cancer diagnoses and outcomes) are needed. Whilst extensive research can be done with the data currently available, issues with access and availability often arise.
Research can be conducted using either individual-level data, with values for every characteristic of each patient, or aggregate-level data, where individual-level data are combined. Whilst individual-level data are most favourable for epidemiological research, access can be restrictive and time-consuming. However, using any data, regardless of the level, comes with its challenges.
In a previous blog post, I discussed the rationale and key aims of my PhD. To recap, my research is looking at monitoring progress in early cancer diagnosis, with a focus on investigating the best metrics to monitor specific early diagnosis initiatives before mortality data are available. Identifying the best early indicators for a reduction in mortality will allow for the timely evaluation of initiatives that could otherwise take many years to conduct.
To carry out the necessary methods for my research, I will need access to lots of data, Whilst I have experienced numerous delays with accessing and analysing data myself, further limiting factors with data access have been highlighted to me during conversations with experts in the field. This post will discuss the various limitations I have encountered, starting with the limitations of using individual-level data.
Limitations of using individual-level data
Although most-favourable, the use of individual-level data is somewhat restricted. My PhD consists of 3 projects. For the final project in my PhD, I will be using individual-level data to investigate the best breast screening measures for predicting breast cancer mortality. This project is still being planned, but already there have been several issues with data access.
The measures I will use in this project were selected using the findings from a previous project in my PhD. At the beginning of my PhD, I set up interviews and focus groups with experts to find out which measures they would recommend for predicting a reduction in breast cancer deaths. Now, I plan to validate these measures using data from a real-life screening programme.
Whilst the aim of the discussions was to identify the best measures to look at using real-life data, the interviews also gave me a valuable insight into the issues researchers face when working with pre-existing data from registries.
My own experience
Before I discuss the issues indicated by experts, I would like to highlight my own experiences with data access. Originally, I planned to validate the measures suggested in the interviews and focus groups using English data. This would require data from the NHS Breast Screening Programme, which is only available in aggregate format, and data on breast cancer outcomes, which is only currently available in individual-level format. Accessing the latter would require me to submit a data application to the National Cancer Registration and Analysis Service (NCRAS). However, as the screening data are only available at an aggregate-level, to look at both datasets together, I would need to aggregate the individual-level breast cancer outcome data to the breast screening unit (BSU) level and this is where I encountered the first of my problems.
Issue 1: Lack of geographical data
Whilst planning to apply for NCRAS data, I soon discovered that each woman’s breast screening unit is not available. The closest geographical variable that could be used to estimate BSU is postcode, however, due to being protected under the common duty law of confidentiality, full postcodes are not available to researchers. As a result, I have found the cancer registration data available in England to be unsuitable for aggregating to certain geographical levels, unless that variable is in the dataset. This has led me to look at alternative options for accessing data for this project, including accessing data from outside the UK.
Common issues encountered by experienced researchers
As previously mentioned in this post, I have also been made aware of several issues faced by experts in the field.
Issue 2: Missing stage data
During the interview and focus group discussions in my first project, breast screening experts highlighted missing ‘stage at diagnosis’ data as one of the main limiting factors in their research.
The National Disease Registration Service records the level of missing stage data per proportion of all cancers diagnosed per year (CancerData). Although this proportion has decreased over recent years, problems persist. The key issue here is that we cannot assume stage at diagnosis when it is missing. There could be multiple reasons why a cancer is registered without a stage; it could be that the cancer simply wasn’t staged, or that it was diagnosed too late to perform the necessary diagnostic investigations. Based on the latter, this could lead some researchers to assume that a cancer is late stage, however, assuming a stage when one has not been recorded could lead to detrimental effects on an investigation’s conclusions.
Whilst the issue of missing stage data was first highlighted to me by experts, I have now experienced the difficulties of missing stage in my own research. The second project within my PhD involved investigating associations between measures of diagnostic activity and late-stage cancers and survival, following the national bowel Be Clear on Cancer campaign in England. The aim of the project was to identify which early measures should be monitored following symptom awareness campaigns. To overcome missing stage data in this project, I used a form of poisson regression adapted to deal with missing data using the command intcount in Stata/MP 16.0 software.
Issue 3: Interval cancers
Further issues highlighted by experts involved the rate of interval cancers for breast screening evaluation. Interval cancers refer to cancers diagnosed between a regular screening appointment that appears normal and the next screening appointment. This measure is often prioritised during programme evaluation, since a high rate of interval cancers would indicate cancers are being missed during screening. In data registries, interval cancers are recorded under symptomatic cancers (cancers diagnosed based on symptoms), which have less complete data than screen-detected cancers. Whilst data on screen-detected cancers is around 95% complete, approximately 27% of stage data is missing for symptomatic cancers. This not only makes the two difficult to compare, but also leads to issues when calculating rates of interval cancers.
Issue 4: Lack of variables available in screening data
The final key limitation of currently available screening data I will mention is a lack of variables for protected characteristics. A main goal for the Operational Leads of the breast screening programme is to increase uptake in groups who are less likely to attend screening. In the interviews, frustration was evident over missing variables in the NHSBSP data that could highlight groups of non-attenders, particularly, ethnicity. This is because previous studies have shown that ethnic minority groups are less likely to accept their screening invitation than White British women, by up to 17%.
Limitations of using aggregate data
Thankfully, I didn’t encounter as many issues when using aggregate data, although as previously mentioned, the use of aggregate data for research is less desirable than using individual-level data. However, it is worth mentioning that aggregating individual-level to a desired geographical level without the correct variables is currently quite difficult. The accuracy of these aggregated datasets could be improved if researchers could request that data providers aggregate certain variables before providing the data – essentially offering a bespoke package once the requirement is adequately justified by the researcher. This could replace researchers aggregating the data themselves using the closest geographical variable available, rather than the actual, desired variable.
I myself have used aggregate data in my second PhD project, looking at how best to monitor symptom awareness campaigns. This project relied on the use of aggregate data available online at the level of Clinical Commissioning groups (CCGs).
Issue 5: Changing CCG boundaries
One key issue that I found during these analyses was the availability of data over time. Numerous variables are easily accessed directly online via Fingertips or the Office for National Statistics, however, the geographical breakdown of variables vary between years. CCG sizes and boundaries have changed numerous times over the last decade, meaning that when I needed to access additional data for my analyses, they were not compatible with my pre-existing data due to different CCG boundaries. Additionally, when aggregate data are available, they are not always well-defined, making it difficult to meaningfully interpret the results.
Limitations from an equality perspective
Based on my experiences when using data, I have also begun to question the impact of these limitations on our population. Recently, I joined the Equality, Diversity, and Inclusion group within the CPG, where I have been part of some very interesting discussions of how data limitations affect certain minority groups that already experience a myriad of inequalities. Already, disparities in cancer survival rates exist for those with lower socioeconomic status, those from minority groups, those from more deprived areas and for those with disabilities. It is essential to be able to identify the extent of these disparities, and how to address them. Without appropriate data on these characteristics, this is difficult and almost impossible.
For this reason, as well as to enable researchers to make valid recommendations to prevent cancer, it is essential that we can access high quality, detailed data in registries and within screening programme. To achieve this, we need to advocate for improvements to be made.
The views expressed are those of the author. Posting of the blog does not signify that the Cancer Prevention Group endorse those views or opinions.