Know Thy Patient: A Novel Approach and Method for Patient Segmentation and Clustering Using Machine Learning to Develop Holistic, Patient-Centered Programs and Treatment Plans | Catalyst non-issue content – nejm.org
Prepare to become a physician, build your knowledge, lead a health care organization, and advance your career with NEJM Group information and services.
NEW! A digital journal for innovative original research and fresh, bold ideas in clinical trial design and clinical decision-making.
NEW! Peer-reviewed journal featuring in-depth articles to accelerate the transformation of health care delivery.
Concise summaries and expert physician commentary that busy clinicians need to enhance patient care.
The most effective and engaging way for clinicians to learn, improve their practice, and prepare for board exams.
Information, resources, and support needed to approach rotations – and life as a resident.
The most advanced way to teach, practice, and assess clinical reasoning skills.
The authorized source of trusted medical research and education for the Chinese-language medical community.
Valuable tools for building a rewarding career in health care.
Information and tools for librarians about site license offerings.
The most trusted, influential source of new medical knowledge and clinical best practices in the world.
By integrating and analyzing metrics associated with barriers to health care access — social vulnerabilities, transportation barriers, lack of insurance coverage — within the clinical context, Parkland Health leaders will be able to better understand the community and patient population they serve in the Dallas–Fort Worth area.
Traditional disease-based clinical programs have been effective in managing and treating specific medical conditions, but often fail to holistically treat the whole person. This can result in negative patient experiences and suboptimal (particularly long-term) clinical outcomes. In part, this failure is because traditional programs may not align with or concurrently address the complex needs of patients, particularly with respect to barriers to health care access (e.g., social vulnerabilities, transportation barriers, lack of insurance coverage). These traditional programs may also fail to incorporate and facilitate stronger provider-to-patient and patient-to-patient connections and support. Instead of grouping patients by their primary disease or diagnosis (diabetes, hypertension, etc.), grouping patients into cohorts with other patients who have high degrees of similarity across clinical, personal, and behavioral characteristics can better facilitate the creation and successful implementation of programs aiming to improve health outcomes. Programs adopting cohort-similarity approaches can more readily incorporate a wide variety of patient-centered, whole-person approaches to care, such as integrated practice units, targeted digital programs, virtual and in-person support groups, and focused outreach and communication. As part of its new 5-year strategic plan to achieve a healthier community, the Dallas County Hospital District — f/k/a Parkland Health & Hospital System, now known as Parkland Health (Parkland) — partnered with the Parkland Center for Clinical Innovation to develop Know Thy Patient, a novel, advanced analytics process to group patients (by factors other than disease groups) to better understand the community and patient population Parkland serves. These advanced analytics provide insights that can improve health care access by supporting and informing a better design of clinical programs enabling new community partnerships, enhanced models for patient engagement, and expanded pathways for treatment via evolving digital strategies.
A critical step toward creating a holistic health care intervention plan is to understand and eliminate access barriers. Access for individuals and groups in health care is defined as “the timely use of personal health services to achieve the best possible health outcomes.”1 We define access here as a combination of three components:
Time to care facility. Socioeconomically vulnerable populations (including those in rural or peripheral areas) often fail to receive proper frequency of care due to longer travel times (and costs) to adequately equipped health care facilities (which often cluster around densely populated urban areas) and face access challenges to public transport and motorized vehicles.2–5
Individual and community characteristics. Disparities in accessing health care facilities for different ethnic and racial communities are well reported.2 Language and education, economic stability, and social and community context affect this access.1,3 Combining personal characteristics (e.g., age, gender, race) and social determinants (e.g., family structure, transportation, income) with utilization of clinics, the ED, and inpatient resources can provide a wider lens to understand access barriers.
Health insurance coverage. Health insurance is a crucial component for health care access and improved outcomes. For example, insured (non-elderly) adults and children have a higher likelihood of receiving proper frequency visits than their uninsured counterparts.6
Of note, we use the terms cluster (a noun) and clustering (a verb) throughout, as described:
Cluster: Refers to a group of individuals with similar patterns of health care utilization and access. Clusters are created by an unsupervised machine learning method called clustering. Clusters are not just physical or geographic groupings. Clusters are created using multiple factors that are listed in the Appendix. Demographic background, clinical interaction and utilization, and social determinants of their neighborhoods are confounding factors for the clusters.
Clustering: Refers to a method of unsupervised machine learning. We chose a clustering algorithm that is deterministic, hence, repeatable. In other words, patients with similar access and utilization patterns clustered together consistently. We provide a list of 53 metrics (Appendix) to a clustering algorithm that outputs people with similar patterns grouped into one cluster.
To achieve a more access-centric patient population segmentation — that incorporates non–disease-specific patient information — we developed an approach with three distinct steps: (1) creating a holistic patient record and identifying key metrics for clustering, (2) clustering the patient population, and (3) characterizing the clusters and extracting insights to enhance access and quality of care.
We first created a clean (all metrics and their data types are well defined, and missing values are imputed, duplicates removed), complete (no exclusion criteria), and holistic patient record (includes metrics beyond patient-specific clinical information, i.e., social determinants). Data identification, ingestion, integration, validation, and staging are crucial steps. Clinicians and data scientists with subject-matter expertise on social vulnerabilities of the community determined which metrics to include.
The resulting, enriched patient records contained demographics, utilization, and insurance coverage information, along with social determinants of health data characterizing the effects of the neighborhood on personal health and health care access.
We initially included all relevant access-related metrics (approximately 300 metrics, Appendix). These metrics include, but are not limited, to patient demographics, clinic utilization, and social determinants. For example, we included multiple data elements to understand transportation-related access barriers (e.g., transit times to closest outpatient clinic/hospital, proximity to public transport). We aggregated 6 years of medical information from Parkland’s electronic medical record (more than 50 million encounters or records) to patient level (i.e., one record per patient, for 630,289 unique patients). Then we enhanced the data with social determinants (i.e., neighborhood-level attributes) based on the most recent patient address. Zip code–level community attributes can relate to one’s health in subtle and important ways (e.g., health care facility transit times,5 housing stability,7 air quality8). The resulting, enriched patient records contained demographics, utilization, and insurance coverage information, along with social determinants of health (SDOH) data characterizing the effects of the neighborhood on personal health and health care access (Figure 1).
Next, we identified key metrics that brought unique information ― unrelated to diagnoses ― to the clustering step. From the 300 metrics, we applied feature reduction by (1) removing medical diagnosis–related metrics so the clustering is not impacted by any medical diagnosis directly, and (2) evaluating the multicollinearity relationship between metrics and removing highly multicollinear metrics (e.g., neighborhood unemployment rate versus prevalence of uninsured) to decrease information redundancy. This feature-reduction process resulted in the defining metrics ― 53 metrics used for clustering, which is the process of establishing cohorts of patients with similar demographic, utilization, SDOH, and insurance coverage metrics, but absent medical diagnosis information (Appendix). We later used the remaining 247 metrics (the descriptor metrics) to profile the cluster characteristics (described later in Step 3). We purposefully excluded medical diagnoses in the clustering to ensure that the segmentation was not guided by clinical diagnoses, but rather by access patterns and personal/community characteristics representing access barriers.
We first used the Uniform Manifold Approximation and Projection (UMAP) algorithm9 to reduce the dimensionality (53 dimensions projected to 2) to improve data visualization and interpretation. We next used HDBScan, a deterministic, hierarchical, clustering algorithm (hence reproducible). Use of both algorithms allowed us to assign a patient not included in the training data into a cohort; i.e., we could place a new Parkland patient into a cluster based on the patient’s demographics, SDOH, and initial interaction with Parkland.
We purposefully excluded medical diagnoses in the clustering to ensure that the segmentation was not guided by clinical diagnoses,but rather by access patterns and personal/community characteristics representing access barriers.
After clustering, we reintroduced all medical diagnosis data and used both the defining and descriptor metrics to perform descriptive analyses to understand cluster profiles. We mapped out clusters to identify neighborhoods of focus, clinic utilization, and travel times to the closest clinic/primary hospital. We also profiled clusters for demographics, cost of care, medical complexity, health care engagement (whether virtual or in-person and including Covid-19 vaccination), and SDOH. Finally, we provided the data-driven insights revealed through clustering and characterization to Parkland executives to help their strategic decision-making and program/intervention planning. We discuss three examples in these exercises in the Applications section. As of July 2022, this work has not yet been implemented with actual patients.
Unique cluster sizes varied from 5.0% (G8) to 17.7% (G1) of the patient population (Figure 2).
Of note, G1, the biggest cluster with about one-fifth of the population, accounted for more than half of the visits (outpatient, inpatient, ED) (Appendix). We created a sample persona based on the characteristics for the G1 cluster (Figure 3); of course, many other personas could be developed, but this offers an example of how the data could be presented to assist the providers and patients in care delivery.
G8 was another high-utilizer group (5% of the population but the highest per-person visit volumes and 17.7% of in-person visits). Both G1 (Appendix) and G8 patients had a higher prevalence of hypertension and diabetes, approximately 40% of all mental health–related encounters, and a lower likelihood of ED or inpatient visits (Table 1).
This compares the clinical utilization, prevalence of chronic conditions, and cost of care characteristics of the Know Thy Patient (KTP) clusters. The top two highest numbers for each metric are highlighted in bold. Notes: Percentages are calculated from the total patient population, which adds up to 630,289 patients including Outliers. The total of the G1–G8 clusters is 512,425. In the Parkland Population Average column, an en dash (–) indicates not applicable. Source: Parkland Center for Clinical Innovation
Of note, G5 patients had a different access pattern (compared to G1 and G8), with a higher likelihood of accessing the system through the ED. G5 patients also had the lowest clinical complexity among all clusters when the Charlson Comorbidity Index (CCI) was used as a proxy (Appendix). Cost per visit for G5 patients was higher than the cost per patient per year because 74% of these patients had less than one visit per year. G6 and G7 clusters had similar encounter volumes to G5 but differed for their demographics (Table 1).
After clustering, we reintroduced all medical diagnosis data and used both the defining and descriptor metrics to perform descriptive analyses to understand cluster profiles.
Parkland reported 10,227 nursery discharges in 2021; among the G2 patients, 82% were female and had the highest likelihood of an inpatient visit (Appendix). An average childbearing-age female in G2 had three times more OB/GYN visits than the same age group in other clusters. G2 patients also had the third highest per-patient visit count (after G8 and G1), the highest dollar amount for the average cost per visit, and one of the highest for average yearly cost. Table 1 provides a detailed comparison of all clusters.
Clusters G1 and G8 made up 22.7% of patients but accounted for 69.9% of all visits. Both groups were clinically complex (based on CCI), had very high prevalence of hypertension (G1 and G8 combined has 2.7 times higher than the average of Parkland Population) and diabetes (G1 and G8 combined has 3 times higher than the average of Parkland Population), and utilized mental-behavioral health 3 to 4 times more than the average Parkland patient. They also had 2 times higher cost per patient per year. They were mostly married, Hispanic females ages 18–64 (38.4% between 18–40; 56.7% between 40–64) who lived in a concentrated southern Dallas corridor. Eighty-seven percent were obese, 39% had a history of smoking, and 15% already had a chronic kidney disease diagnosis. Enrollment of these patients into multiple disease-specific programs (diabetes or hypertension), while individually effective, resulted in significant outpatient utilization due to their complex, multidimensional clinical and personal needs, taxing both the health system and the patients.
These insights drove a decision to explore the individuals/clusters who were cardio-metabolically high risk (CMHR) and individuals with both diabetes and hypertension diagnoses to design access sites and programs consolidating clinical expertise and diagnostics to meet these patients’ complex needs and better manage their health. We determined that an integrated practice unit (IPU) designed around CMHR could serve a cohort of 46,253 patients already receiving care from Parkland (Figure 4).
The CMHR IPU will be unique in design as (1) through the cluster profiles, we understand the patients’ access and utilization patterns, and (2) it is based on patients with a combination of two diseases, uniquely inferred from the clusters. Based on cohort characteristics (Figure 4), when planning this IPU, administrators are considering translation services (4 of 10 patients prefer Spanish), integrating smoking cessation programs and dietitians or nutritionists to address high prevalence of obesity and smoking, and designing on-demand in-person or telehealth services for mental-behavioral health. Certain outpatient clinics (black circles on Figure 4 map) receive more than 1 million encounters from this population, thus serving as natural candidates for possible CMHR IPU locations.
Like the clustering-based CMHR IPU, patient populations for single disease-based IPUs (e.g., colorectal and lung cancer, chronic heart failure, multiple sclerosis) currently undergo a detailed KTP analytics profiling to better understand their utilization patterns, clinical needs, demographics, location relative to clinical access points, and SDOH factors.
These insights drove a decision to explore the individuals/clusters who were cardio-metabolically high risk and individuals with both diabetes and hypertension diagnoses to design access sites and programs consolidating clinical expertise and diagnostics to meet these patients’ complex needs and better manage their health.
Design of new clinical pathways and access points often requires a balance of physical locations and services complemented by digital access options. To better understand patient populations who are well-positioned to benefit from (and participate in) virtual care, Parkland identified two distinct populations regarding care-model preferences during the height of the pandemic (March–September 2020): (1) those who received care only via telemedicine (48,246 unique individuals), and (2) those who insisted on continuing in-person visits (50,763 unique individuals). The purpose was to leverage unique KTP cohort profiling and look for insights into population differences to better identify and address inequities and access barriers to digital solutions. Looking at this data during Covid-19 (when arguably more individuals should be using telehealth) provided Parkland with important insights on potential implicit biases with respect to its telehealth offerings (Table 2).
This compares two user-segmented cohorts, in-person care versus virtual care, including patient-level and neighborhood-level information. Demographic, clinical utilization, and social determinants features are compared. Note: oo indicates higher prevalence than o. Abbreviations: CCI = Charlson Comorbidity Index ; PHHS = Parkland Health; COPC = Community Oriented Primary Care. Source: Parkland Center for Clinical Innovation
A cohort demographics comparison showed that the in-person care cohort had more female patients (largely due to child deliveries, which also decreased the cohort’s median age), a higher percentage of Hispanic patients, and a lower percentage of Black patients. Contrary to popular belief, when comparing adults in both cohorts, we were surprised to find that virtual care patients had more complex chronic diseases (higher average CCI). Perhaps less surprisingly, the virtual care cohort had higher Covid-19 vaccination rates. Likewise, more virtual care patients had commercial insurance and Medicare coverage, and fewer had Medicaid and charity-compensated coverage. When we compared SDOH, the in-person care cohort lived in higher-vulnerability neighborhoods, had lower English literacy rates and education levels, and had median household incomes $2,000 less than the virtual care cohort (with more living below the poverty level). The virtual care cohort lived in neighborhoods where transit time to Parkland Hospital/clinics were on average farther then the in-person cohort.
While Internet access was not significantly different between these two cohorts, we identified several considerations for the more equitable design/expansion of virtual engagement options: (1) transit time to the nearest physical clinical location is a major driver of telehealth adoption, and (2) age and clinical complexity are not barriers to digital adoption, i.e., complex, older patients engage equally (or more) in telehealth services. Also, personal preferences based on ethnic backgrounds and supportive translation services are critical in anticipating adoption of digital options and facilitating successful visits.
The primary value from our analysis came from identifying that we had unintentionally offered virtual care less often to those with less favorable insurance coverage and more social needs. We are working to address these items with staff education.
The primary value from our analysis came from identifying that we had unintentionally offered virtual care less often to those with less favorable insurance coverage and more social needs. We are working to address these items with staff education. Parkland has also operationalized the capture of both patient and provider preferences for care visits as part of expanding its telemedicine offerings to better meet patient needs and capabilities.
A holistic health care database containing encounter-, patient-, and community-level information can provide — through multiple cohort comparison and hypothesis testing — data-driven insights to inform new clinical interventions that treat the whole person. Through incorporating clustering by non-disease states (e.g., individual/community characteristics) and integrating the clustering insights (e.g., access patterns) into medical records, health care systems can gain a more complete understanding of their patients to make care better, personalized, and more accessible. Patient engagement can also improve (e.g., new “Quick Clinics” in targeted locations or medically supervised virtual support groups ― potentially less costly and catered for each cluster’s needs and access patterns). Social networks have demonstrated that individuals who have a lot in common beyond medical conditions are exponentially more likely to create stronger bonds and sustained engagement with each other.10
The potential use and value of this novel Know Thy Patient advanced analytics and clustering approach is not limited to health care systems. For example, claims-based clustering can be applied to understand the access and utilization behaviors of insurance plan members (“Know Thy Member”) and inform targeted programs to eliminate their access barriers.
Know Thy Patient Data Clusters
We extend special thanks to our Senior Director of Grants Management, Elizabeth Powell; clinical director George “Holt” Oliver, MD; our executive directors Russell F. Lewis and Leslie Wainwright, PhD; and our stakeholders from Parkland Health: Brett Moran, MD, Joseph Longo, MBA, Roberto De La Cruz, MD, and Francesco Mainetti, MS.
Yusuf Talha Tamer, Albert Karam, Thomas Roderick, and Steve Miff have nothing to disclose.
Address equity. Increase value.
Free weekly newsletter highlighting peer-reviewed
content, live web events and research reports from
NEJM Catalyst Innovations in Care Delivery
Foundational principles for launching data science models in health care that actually benefit both clinicians and patients.
How a machine-learning analysis of routine blood test results identifies patients at high risk of colorectal cancer who are overdue for screening.
AI THEME ISSUE: Oak Street Health’s machine-learning models outperformed provider assessments alone at assigning patient tiers on the basis of three key outcomes: acute inpatient admissions, medical cost, and mortality. The models do a better job of identifying patients who will benefit most from more frequent visits and routine monitoring.