How can we use ‘Big Data’ to improve early cancer detection?
I’m a cancer data scientist in the BRC’S Prediction and Early Detection Theme. My research is in people with Type 2 diabetes and aims to identify those people that are most likely to have a cancer diagnosis in the next 10 years and getting them into suitable cancer screening programmes.
So how do we go about identifying those people to catch cancer early?
We could conduct a clinical trial, identify people with Type 2 diabetes and invite them in for screening and interview every year. However this would be expensive and take a very long time. In practical terms knowing who in the current population has the greatest likelihood of developing cancer in the next 5-10 years is more helpful.
Is there already data being collected that could be used?
Yes, the interesting part of my project is I have access to data that is routinely collected by medical professionals during patient consultations. This is anonymous data provided securely and all projects go through ethical review. This means that not only do I have information on the current health of people but I can look back at the historic data and try to identify other key factors.
Using data from GP records provides the ideal set of information. This means for a research project, no new data needs to be collected, which would take up patient time and need new expensive data warehouses to store the data.
What do we mean by ‘Big Data’?
This is a general term used in the media to describe large collections of information, both structured and unstructured. This can range from the information supermarkets collect on shoppers to the information in NHS patient records.
In my research I have anonymised information on 330,000 people with Type 2 diabetes. Each person then has multiple rows of data showing when they visited the GP, what drugs were prescribed or conditions diagnosed. This information extends back to when GP systems were first computerised in the 1990s. It is then linked with Cancer Intelligence Service Data. This means each person can have 1000s and 1000s of rows of data, so it’s easy to see how the databases become ‘Big Data’.
How can we use ‘Big Data’ in cancer research?
To identify who is at high risk we need to know information about a person’s medical history and characteristics, for example if you smoke or how old you were when you developed diabetes. These go on to form what are known as risk factors.
A key area of my research is looking into people’s weight change over time and if this can increase (or decrease) cancer risk. For example people often lose weight when they become unwell, but by looking back in a medical record to before someone is diagnosed with a condition we can get a more representative measure. These key pieces of information are put into a risk(mathematical) model and used to calculate your risk of cancer, or identify groups of people who have a higher than average risk.
The more data we have, the more sophisticated our risk models can get. An example of this is the work conducted in Breast Cancer risk prediction models – where Manchester is leading the way. The best models identify those people that are definitely at greater risk, but crucially exclude those who are not. This means we’re able to avoid putting people through additional screening or tests which can be stressful and scary…
There are concerns about using routinely collected data, and having access to lots of information on individuals. However some people would argue it would be wrong not to make best use of routinely collected data if it could improve health. In fact I’ve been told by one patient “As a cancer patient I want, NAY demand that my data is used, safely and securely, for research and patient benefit”
What’s next?
Supported by BRC funding, I’ll be continuing to analyse GP record data to explore the theory that those with Type 2 diabetes may have an increased cancer risk. This is just one example of how routinely collected data can be used in cancer research. The BRC PED team will be looking for opportunities to interrogate routinely collected data in other scenarios over the course of the project.