AI is getting increasingly sophisticated and offers promising opportunities to reinvent the healthcare industry. The potential use of AI is vast, and the global pandemic itself cements the tremendous role data science and analytics play in providing crucial, timely insights.
Areas where data science is making a big difference in the healthcare industry:
Yet, the development and use of AI algorithms does not come without challenges that data scientists need to address in their day to day.
The biggest challenge in health data science is having access to high-quality data.This is blocked by privacy regulations and technological restrictions, or the risk aversion to data sharing on the part of data custodians or ethics committees. Health systems generate a wealth of sensitive data every day, but are unable to realise its full value. Data is locked in complex, siloed data environments, sometimes governed by legacy systems that don’t easily interface with the latest technologies or can be migrated as fast as one would hope.
Without immediate access to sensitive data, AI applications and ML model development are delayed and limited. Yet the insights that this data provides to practitioners is the key to improving all aspects of healthcare and better serving patients.
Healthcare data is extensive, yet not always as clean or organized as one data scientist would like it to be. Before building the actual models, data scientists spend up to 80% of their time and efforts getting the data right, understanding the underlying relationships, and ensuring that the data set is representative of the use case and ready to be used. Missing data is common in medical data sets, but if the data are not recorded for some systematic reason, then this must be accounted for in the modelling process.
ML models are only as good as the data with which they are developed and in high stakes applications such as healthcare data quality and coverage are of the utmost importance.
Lack of access to large, diverse medical datasets can lead to biased algorithms. It all comes down to the training dataset that is used. If there are gaps in the data training set, then these gaps will affect the usability and utility of the AI application, ultimately affecting real people.
Recent analysis in the US showed many digital image AI algorithms are trained with data from academic institutions in just three states potentially resulting in bias and incomplete coverage. Using more complete and representative data will reduce the bias in the output.
The way the data is collected can equally have a tremendous effect. For example, data collected within an inpatient setting may reflect the more severe patients who are already diagnosed with a specific disease or are at higher risk of developing it, yet not truly reflect incidence within the general population as less-severe patients may be managed in primary care.
There is growing concern that algorithms may reproduce racial and gender disparities. For example, algorithms trained with gender-imbalanced data perform worse at reading chest x-rays for an underrepresented gender, and researchers are already concerned that skin-cancer detection algorithms, many of which are trained primarily on light-skinned individuals, are worse at detecting skin cancer affecting darker skin.
Given the consequences at scale of an incorrect decision, medical AI algorithms need to be trained with datasets drawn from diverse population groups (race, gender and geography). But there’s more to it than providing diverse, representative training data, as doing so does not guarantee elimination of bias. Taking a conscious proactive approach towards ensuring fairness in data, identification and mitigation of bias in the dataset are crucial in healthcare.
Technology advancements in the space of cloud computing, clinician mobility, and remote patient data access bring unquestionable benefits towards improving healthcare. Yet, they also trigger increased concerns over the security and privacy of such sensitive information. The extensive financial costs for any error in data sharing or non compliance provide strong incentives to protect data at all costs, impeding innovation in the process.
Data privacy risks ensure a very conservative approach with regards to sharing data or linking it across different sites or trusts. Fear of negative news headlines and the extensive financial and reputational costs of data breaches may be an extra factor in that discussion.
Data guardians are understandably reluctant to share patient data with researchers even though such data sharing carries positive implications for humanity and medical progress can be accelerated if data was more accessible.
So how do we put patients first both in terms of their right to privacy and their health outcomes? How can we treat more patients, more efficiently, while respecting the sanctity of their data? Where is the ethical balance between avoiding privacy risks and ensuring wider population benefits?
In subsequent blogs we'll explore these questions and possible solutions in depth.