By David Dastvar and Andrew Goldberg
This article describes a central challenge in creating the National Public Health Data System (NPHDS) and provides recommendations to include replacing Personal Identifiable Information (PII) with a unique but persistent universal patient enumerator before transferring the data to the central NPHDS repository—enabling the integration of health data on a fully interoperable but privacy- protected basis.
Introduction
During the pandemic, U.S. health data systems were exposed as existing in siloed, fragmented databases that hindered research, policymaking, and public guidance. The data exists in myriad formats and is not readily analyzable as unified, longitudinal patient journeys through the Healthcare system, impeding decision making about pressing public health issues.
As a result, there is a newfound resolve among Federal decision-makers to integrate the nation’s health data, starting with Federal databases. The creation of the proposed National Public Health Data System (NPHDS) will be invaluable in creating a repository of connected, consistently formatted data sets. This strategy will equip policymakers and public health researchers with the representative and reliable data sets needed to make impactful public health decisions.
Challenge 1: Linking Datasets with High Accuracy
There is typically a large rate of error in linking siloed data sets. In fact, most private sector data vendors offer data sets with estimated 3–5% false-positive and 9–42% false-negative match rates. Such poor accuracy rates are highly problematic. High false positive, incorrect matches may facilitate policymakers and researchers drawing incorrect inferences from the data. Worse yet, high false negative rates (incorrectly unmatched cases) result in fragmented patient journeys and suboptimal knowledge generation.
Challenge 2: Protecting Patients’ Personal Information
Meeting the need for extensive longitudinal linking of personal data in a way that preserves privacy is critical. The patient identity resolution issue is especially acute for the health data in Federal databases, almost none of which contain unique patient identifiers such as Social Security numbers and therefore present a plethora of linkage challenges.
The Solution: Data Linkage with High Accuracy while Protecting PII
The technologies and methods recommended here for the NPHDS are characterized by false-positive and false-negative rates that are lower than typical industry rates by a factor of 10. Concurrently, protection of patient’s PII and protected health information (PHI) is at the forefront of the process and protection-related procedures and mechanisms are strategically intertwined throughout data acquisition, processing, storage, analysis, and sharing.
Protecting PII
One of the principal features of the recommended system is that the NPHDS does not directly receive any PII. To achieve this, the system utilizes a lightweight de-identification engine installed locally on each data owner’s server. While safeguarded by the data owner’s firewall, all protected health information is removed by the engine and replaced using sophisticated algorithms that generate an encrypted hash of the original PII. The hash function is one way, meaning that it is statistically impossible to reverse the computation and discover the original PII values. Additionally, the solution relies on privacy and governance techniques that are consistent with a master HIPAA certification. Consequently, all health data leaving each data owner’s site would be considered de-identified under HIPAA prior to transmission. This method is certified as being in full accordance with HIPAA expert determination provisions.
To further minimize risk during data acquisition, all patient hashes should be encrypted and sent to a centralized server for automated matching. This procedure protects each input system both from other input systems and from the NPHDS—enabling each data owner system to continue maintaining its own PII independently.
Stage 1 of Matching
We recommend using probabilistic matching. To begin, the initial encrypted PII data should be used to resolve a patient’s identity by assigning the correct patient identity from a central master database of identities. In most cases, the patient’s encryption from a particular data set will be assigned an ID from a national database. However, a new ID may be assigned in the rare case that the system believes the individual may be a new patient. If the system were to rely on purely deterministic matching, new patient identities would be created whenever there are slight variations with a patient’s identifying information (e.g., misspelled names or a new address), even though the data represents a single, existing patient.
The probabilistic matching employed by this solution, however, accounts for variations in patient identity by comparing several field values to determine an accurate match, resulting in the lowest false-positive and false-negative rates attainable.
To complete the first stage of matching, building on multiple personal identifiers and the associated algorithm, probabilistic matching accumulates the probability of each piece of evidence, seamlessly accounting for larger probabilities relating to missing data, address changes, and ambiguities in the data.
Figure 1 depicts major NPHDS components, including engines for data de-identification and transfer, as well as for resolving patient identities through the described stages of matching.
Stage 2 of Matching
In the final stage of matching, machine learning should be used to account for additional patterns in the data. These models can adjust matching probabilities and match cases based on further information that can be used to build the model, such as:
- The frequency of each name, location, and phone number
- Relocation patterns (e.g., patients moving between partially masked, three-digit zip codes such as from Long Island, NY, to West Palm Beach, FL)
- Rates of typos, swaps, and other errors in data entry
Moving Forward with the NPHDS Data Linkage System
The COVID-19 pandemic underscored the need for a centralized public health data repository—the NPHDS—that can be used for fast and accurate analyses to support public health decision-making. However, to comply with HIPAA and other privacy protections, patients’ PII should not be transferred across data systems. Instead, patient data can be encoded and encrypted before transfer to the central system and matching can proceed on the hashed data.
We recommend when attempting to link individuals across data sets, probabilistic matching should be used, not deterministic matching. A confidence score can be determined indicating whether two slightly differently appearing individuals represent the same person. Final categorization with machine learning models can use the confidence score and other patterns in the data to assign matching status. These are best practices for linking individuals’ data across disparate data systems even beyond the Healthcare realm when unique identifiers are not available.
About the Authors
David Dastvar serves as Chief Growth Officer at Eagle Technologies. He has 29 years of experience developing and managing enterprise-level professional services and solutions for Government agencies, health IT, public sector, and Fortune 1000 companies. He can be reached at (202) 497-8848 or david.dastvar@eagletechva.com.
Andrew Goldberg is Co-Founder and COO at HealthVerity, Inc. He has more than 30 years of cross-sector experience focused on creating network effects through data connectivity. In addition, Mr. Goldberg has led corporate growth and development across various IT communications and cloud solutions companies.
About the Companies
Eagle Technologies, Inc., is a Technical and Systems Integrator that delivers effective health IT and grants management solutions transcending the complex requirements of Government and business clients nationwide. Eagle leverages its depth of experience delivering end-to-end solutions that rely on advanced cybersecurity and privacy systems, enterprise architecture, cloud-based services, mobile computing, and business intelligence services.
HealthVerity, Inc. – Pharmaceutical manufacturers, payers and Government organizations have partnered with HealthVerity to solve some of their most complicated use cases through transformative technologies and real-world data infrastructure. The HealthVerity IPGE platform, based on the foundational elements of Identity, Privacy, Governance and Exchange, enables the discovery of RWD across the broadest healthcare data ecosystem, the building of more complete and accurate patient journeys and the ability to power best-in-class analytics and applications with flexibility and ease.