When evaluating any RWD provider, I assess five dimensions. No single provider scores perfectly on all five. The goal is to find the best match for your specific question, patient population, and governance requirements.
1
Population Coverage and Representativeness
The first question is not how many patients are in the database — it is whether your patients are in the database.
Key questions to ask any provider: What is the payer mix of your covered population — commercial, Medicare, Medicaid, uninsured? What is the geographic distribution, and are certain states or regions underrepresented? What is the time lag between care delivery and data availability? How are specialty drug patients captured — many commercial claims databases have significant gaps in specialty pharmacy data.
For rare disease, this dimension is often disqualifying. A database covering 300 million commercially insured lives may contain only a few hundred confirmed patients with a condition affecting one in 50,000 people — not enough for meaningful analysis. In these cases, disease-specific registries, EHR networks with deep specialty coverage, or diagnostic lab data are often the only viable options.
2
Data Type and Clinical Depth
Different questions require different data. The major categories:
Administrative claims capture billing events — diagnoses, procedures, pharmacy fills, and provider encounters. Claims are longitudinal, comprehensive for covered services, and strong for treatment pattern analysis, market share measurement, and adherence tracking. Their limitation is clinical thinness: claims tell you what happened but rarely why, and they do not capture lab values, imaging findings, clinical notes, or disease severity.
Electronic health records (EHR) provide richer clinical detail — lab values, vital signs, physician notes, biomarkers, disease severity measures. This depth is essential for outcomes research and studies requiring granular patient characterization. The limitation is fragmentation: a patient treated across multiple health systems appears as multiple incomplete records unless those records are linked.
Specialty pharmacy data captures dispensing and refill records for specialty drugs — the most accurate source for specialty drug adherence, persistence, and real-world dosing patterns. This data is proprietary to the manufacturer and typically requires a hub or specialty pharmacy aggregation arrangement.
Patient registries are purpose-built for specific diseases, collecting standardized clinical data prospectively. Registry data has the highest clinical quality but the smallest population size. For rare diseases, well-curated registries — the Primary Immune Deficiency Foundation registry, the NHF Bleeding Disorders registry, HAE patient databases — are often the only credible source for outcomes research.
Diagnostic and laboratory data is the most underappreciated data type in pharma RWD strategy — and the one that diagnostic companies like Quest Diagnostics and LabCorp have quietly built into significant data businesses. Lab data contains something neither claims nor EHR reliably provides: actual measured biomarker values. Creatinine trends, HbA1c trajectories, eGFR progression, PSA levels, CBC results, genetic panel findings — this is the clinical signal that drives treatment decisions and defines patient eligibility for many therapies. Lab data is also a powerful patient identification engine: an abnormal result often precedes an ICD-10 diagnosis code by months or years.
Linked datasets combine multiple data types — claims linked to EHR, EHR linked to genomics, claims linked to specialty pharmacy — to produce a more complete longitudinal record. HealthVerity and Datavant specialize in this linkage infrastructure. Linked data is more powerful but more expensive, and the linkage methodology requires careful scrutiny.
3
Therapeutic Area and Indication Fit
Every major RWD provider has therapeutic area strengths that are not widely advertised — and weaknesses that are even less so.
Flatiron Health has unmatched depth in structured oncology EHR data, built from a network of community oncology practices. For oncology outcomes research — real-world progression-free survival, treatment sequencing, biomarker-outcome associations — Flatiron is the standard. For any non-oncology indication, it is the wrong choice.
Komodo Health has invested heavily in rare disease patient identification and longitudinal journey mapping, using its Healthcare Map to trace fragmented patient records across care settings. For rare disease market sizing, patient finding, and undiagnosed patient identification, Komodo is one of the strongest options.
Optum Life Sciences brings a unique payer perspective — its data derives from UnitedHealth Group's claims, giving it exceptional visibility into cost, utilization, and access dynamics. For market access analytics, HEOR studies, and budget impact modeling, Optum's payer-linked data is a significant advantage.
IQVIA has the broadest geographic coverage globally, the deepest prescription data (NPA and DDD datasets cover virtually all dispensed prescriptions in the US), and the most comprehensive commercial analytics suite. For market sizing, competitive share analysis, and prescription trend monitoring, IQVIA is the most complete single-source solution — at a price to match.
Symphony Health (ICON) offers strong prescription and sales data with broad channel coverage — retail, institutional, and integrated delivery networks. For commercial analytics teams tracking market share and competitive dynamics, Symphony is a credible IQVIA alternative.
HealthVerity specializes in data linkage — connecting claims, EHR, specialty pharmacy, lab, and other sources using privacy-preserving identity resolution. When no single source is sufficient and you need a linked longitudinal view of the patient journey, HealthVerity is the platform of choice.
TriNetX provides a federated network of real-time EHR data from academic medical centers and health systems, with a particular strength in clinical trial feasibility — identifying eligible patient populations at specific sites before a trial begins.
Definitive Healthcare focuses on provider and facility intelligence — who practices where, what procedures they perform, what their referral networks look like. For site-of-care analytics, physician targeting, and commercial field force optimization, Definitive is the specialist.
PurpleLab is one of the fastest-growing mid-tier claims platforms, built around its HealthNexus no-code analytics environment. It aggregates over 50 billion medical and pharmaceutical claims covering more than 330 million patient lives and 98% of US payers. PurpleLab's strongest differentiator is accessibility — its no-code interface allows teams to move from question to insight in minutes. A 2025 acquisition of KAID Health added AI-powered NLP on clinical notes to its structured claims foundation. Best fit for commercial pharma teams that need rapid, self-service analytics without deep data engineering resources.
Truveta is purpose-built for regulatory-grade RWE, founded as a consortium of major US health systems contributing daily-updated EHR data to a shared research platform. Truveta Data covers EHR data from more than 120 million patients linked with closed claims for more than 200 million patients across 100+ payers. Its Truveta Language Model normalizes unstructured clinical notes, lab values, and imaging data at scale. For regulatory-grade HEOR, safety monitoring, and comparative effectiveness requiring daily-updated, FDA-provenance-documented data, Truveta is the right choice.
Forian occupies a distinctive position through its CHRONOS hybrid ecosystem, which combines open claims, closed claims, EHR, remittance data, and social determinants of health (SDoH) covering over 350 million de-identified patients since 2015. Open claims offer longer patient follow-up, less data lag, and more frequent refresh than closed claims alone — making Forian especially valuable for tracking patients across payer transitions. Its SDoH variables (race/ethnicity, income, education, occupation) expand the analytical frame beyond what claims or EHR alone can provide.
Quest Diagnostics has built a systematic data business — Quest Data Insights — on one of the largest clinical laboratory networks in the United States. Its de-identified lab data covers hundreds of millions of tests annually and can be merged with data from Datavant, HealthVerity, Komodo, and Symphony Health. Quest's geographic breadth captures results from patients seen at independent physician offices and retail clinics that do not appear in hospital-centric EHR networks. Best for oncology biomarker research, rare disease patient identification, and studies requiring actual measured lab values rather than diagnosis codes as proxies.
LabCorp is Quest's closest peer — a national laboratory network with over 45 billion annual test results across 6,500 diagnostic assays and 160 million patient encounters. LabCorp's acquisition of Invitae adds genetic and variant data for studies that connect genotype to phenotype to treatment outcome. Its strategic investment in HealthVerity enables lab data to be linked with a broader RWD ecosystem. For pharma teams in precision medicine, oncology, cardiovascular, and rare genetic disease, LabCorp's combination of diagnostic depth, genomics, and external linkage capability is difficult to match from any other single provider.
4
Data Governance, Compliance, and Provenance
This dimension is non-negotiable for organizations in regulated environments, yet it is the one most frequently skipped.
HIPAA Safe Harbor and Expert Determination are the two accepted de-identification methods — understand which your provider uses and what the re-identification risk assessment shows, especially when combining datasets. Ask about data provenance: some providers are transparent about their source networks; others keep sources proprietary. Opaque provenance makes it impossible to assess representativeness or systematic bias. Review contractual use restrictions carefully — most licenses prohibit individual patient identification and regulatory submissions without additional validation. For regulatory-grade RWD, the FDA's fit-for-purpose assessment framework is explicit about what is required, and not all commercial providers have validated their data to that standard.
5
Access Model and Total Cost of Ownership
Annual enterprise subscriptions can run from $200,000 to several million dollars and are frequently underutilized. For defined, project-specific questions, a targeted cohort extraction license is usually more cost-effective. Beyond the data fee, account for analytical infrastructure: raw RWD requires substantial processing before it is analytically useful, and some providers deliver analytics-ready extracts while others deliver raw files that require significant capability to use. Finally, understand actual delivery timelines — for time-pressured decisions, the difference between two weeks and six weeks matters.
Diagnostic and laboratory data is the most underappreciated data type in pharma RWD strategy. An abnormal result often precedes an ICD-10 diagnosis code by months or years — making lab data a powerful patient identification engine that neither claims nor EHR can replicate.
The Symetrique Perspective
No single dimension should dominate the evaluation in isolation. We have seen organizations over-index on cost and discover governance gaps after procurement — and others over-invest in clinical depth for a question that claims data alone could have answered in weeks. Assess all five in sequence, weighted to the specific analytical question at hand.
Madhav Kathikar — Founder, Symetrique
#RWDEvaluation#DataGovernance#ClinicalData
#HIPAA#PharmaRWD#Symetrique