Partially Synthetic Text Generation for Sensitive Clinical Notes

Privacy, Utility and Readability Oriented Clinical Notes De-identification Framework


  • Working on the project – Partially Synthetic Text Generation for Sensitive Clinical Notes.
  • Implemented the DataSifter Text algorithm based on pre-trained language models (e.g., BERT) for obfuscating patients’ privacy from doctors’ notes within a large healthcare dataset MIMICIII while maintaining utility and human readability.

Electronic health records (EHRs) are valuable resource that could potentially be used in large scale medical research. In addition to structured medical data, EHRs contain free-text patient notes that are a rich source of information. However, due to privacy and data protection laws, medical records can only be shared and used for research if they are sanitized to not include information potentially identifying patients. The protected health information (PHI) that may not be shared includes potentially identifying information such as names, geographic identifiers, dates, and account numbers; the American Health Insurance Portability Accountability Act1 (HIPAA, 1996) defines 18 categories of PHI (

De-identification is the task of detecting protected health information (PHI) in medical text. It is a critical step in sanitizing electronic health records (EHRs) to be shared for research. Automatic de-identification classifiers can significantly speed up the sanitization process. However, obtaining a large and diverse dataset to train such a classifier that works well across many types of medical text poses a challenge as privacy laws prohibit the sharing of raw medical records.

In this work, we propose a clinical notes de-identification framework focusing on the detailed utility we want to preserve from data user, the privacy we want to protect from data provider, and the readability from both sides. Specifically, we argue that the utility and privacy are properties attached to the clinical notes, and the framework balances between them by adversarially detecting the properties we protect and classifying the properties we preserve.

Overall, the main contributions of our work are two-fold:

  • We propose a new clinical notes de-identification framework to boost sensitive clinical data sharing among agencies.
  • We explore the possibilities of specifying different properties to protect privacy, preserve utility and improve readability as a generalization of de-identification tasks.

In this work, we are using two datasets:

  • CDC

    • In 2019, the Centers for Disease Control and Prevention (CDC) National Institute for Occupational Safety and Health (NIOSH) launched a text classification challenge to automatically classify injury records based on text description.
  • Mimic-iii

    • The Medical Information Mart for Intensive Care III (MIMIC III) database contains electronic health records from a large tertiary care hospital between 2001 and 2012.

The overall framework structure is shown below and still under working:


  • Original text is masked based on critical words lists and probability tables generated from transfer learning.
  • Masked text is imputed by pre-trained language models (e.g., BERT and RoBERTa) in the Generative Adversarial Network (GAN) structure.
    • We consider using Generative Pre-trained Transformer 3 (GPT-3), which has shown some impressive results in the synthetic generation of human-like text/speech in the future study.
    • Recent work shows that for domains with sizeable unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models. Then we plan to use PubMedBERT as pre-trained from scratch using abstracts from PubMed and full-text articles from PubMedCentral to replace previous language models.
    • The GAN structure is used to generate better text to improve human readability. We will ask the language model to distinguish imputed and actual clinical notes in the discriminator to enhance human readability.
  • Different losses are used to update the GAN structure to generate better text regarding privacy and utility.

Under construction…🚧