Mining patient medical data

The goal of this project is to search large patient medical history data to discover unknown environmental and genetic causes of diseases.

Background

Humans are very complex organisms. Many diseases (e.g., diabetes, multiple sclerosis, Alzheimer's) that affect a large fraction of the population especially at older age, most likely have interwoven genetic and environmental (e.g., food, chemicals, pathogens etc.) causes. There has been vast efforts in identifying the genetic cause of diseases (association studies, candidate-gene approaches). Many genes have been identified and collected in databases such as Omim.

Large data is extremely powerful, and medical data is not any different. In the past few years, Medicare patients' medical data has been mined for disease correlations by Barabasi Lab (Lee et al. , Park et al. Hidalgo et al.) . When a patient goes to the hospital, every diagnosed condition and disease is recorded in terms of ICD-9 codes. This data contains clues about which pairs of diseases (a) cooccur together, (b) avoid each other, (c) have no correlation whatsoever. This disease relationships are determined by comparing the occurrence rate to the random expectation. We will skip the mathematical details here of how to calculate the expected relative risks and the confidence intervals of these values.

This project

I hope to connect the gap between molecular level research of disease pathways with the population level data analysis. In order to do this, I am currently focusing on three diseases (1) spinocerebellar ataxia, (2) osteoporosis and (3) Parkinson's disease.

The initial step in the project is to identify the diseases that are comorbid (cooccur) or antimorbid (avoid) with each of these diseases. I will continue with osteoporosis as an example here. Parkinson's data is available here.

Osteoporosis analysis were done only on female patients due to its high prevalency among females. The analysis shows that the diseases and conditions listed in Table 1 are antimorbid with osteoporosis which means that females who have osteoporosis are less likely to get these diseases than females who do NOT have osteoporosis.

ICD-9 code

Name of disease

RR (95% conf)

218.2

Leiomyoma of uterus

0.43

250.4

Diabetes with renal manifestations

0.47

097.1

Latent syphilis

0.48

611.3

Fat necrosis of breast

0.33

362.02

Proliferative diabetic retinopathy

0.43

E94.47

Uric acid metabolism drugs causing adverse effects in therapeutic use

0.33

Table 1: A few of the diseases/conditions that are antimorbid with osteoporosis.



Second step in the analysis is to determine if genetic factors are known for any of the diseases that cooccur or avoid osteoporosis. For example, researchers found that the human erythropoietin a glycoprotein hormone coded by gene epo is known to protect from diabetic neuropathy. Since patients with osteoporosis are less likely to have diabetic retinopathy, this association makes epo a good candidate to have a functional link with genes associated with osteoporosis.

Other than genetic factors, environmental factors (food, drugs, pathogens) are equally important in the determination of which complex diseases we are likely to get. The research of pathogen-disease relationships is attracting a lot of attention, e.g., the role of viruses in causing cancer. The analysis described here also can give information about environmental factors. We see that osteoporosis patients are less likely to get latent syphilis, a bacterial disease, and also see adverse effects from uric acid metabolic drugs.

Third step involves integrating this information via a network approach to construct a disease pathway. In this step, we make use available molecular biological data such as transcriptional links and protein-protein interaction links to merge the genes/proteins into a disease module. I will post data here soon.

Data

I currently have access to anonymized Medicare data of 13,039,018 patients with over 34,000,000 hospital visits over the course of 4 years. The patients are all older than 65. This age range can be suitable or limiting depending on the sought disease.

Hospitals and insurance companies have lots of patient medical record data that could be invaluable for this type of study. Three factors are extremely important to extract the best available information out of the cumulative disease patterns that emerge in the population; length of the records (preferable from birth to death), coverage (records of every condition/disease, medication), and number of patients (more patients means less noise). The recent efforts of organizing health data (Google Health, Microsoft Healthvault) are very promising if made available for use for research purposes.



For comments, pls contact me via email at: natali_dot_gulbahce_at_gmail_dot_com