PyLogik for De-Id’ing Medical Image Data | by Adrienne Kline | Jun, 2023


An open-source state-of-the-art medical image de-identification tool

Adrienne Kline

Towards Data Science

Image by author

Repositories of data are now one of our most valuable commodities. Information as a commodity is not a new concept, but our 21st-century world looks much different now than it did previously. The AI race is on, and in lockstep are the development of tools and resources we need to facilitate it. Pooling information to create inferences that are robust, generalizable, and have utility for our day-to-day is far more difficult than it sounds. This is particularly true with respect to medical data. Medical data suffers from a myriad of difficulties when it comes to extracting value on aggregate and in the development of algorithms. In addition to being noisy, they are tightly regulated by the institutions and care centers that act as data stewards (and for good reason), as it contains personal health information (PHI).

PHI, if leaked, could have harmful effects on the privacy of individuals. These could range from a simple embarrassment to discrimination in the workplace/or with insurance companies to identity theft. Therefore, when researchers and institutions agree to pool information to create repositories, there must exist a data usage agreement and tools to de-identify the data as much as possible. PHI can take several forms — Direct: healthcare numbers, SSNs, and birthdates, although names are not technically unique, they are treated as such. There are also quasi-identifiers, such as the date the image was collected. Further, when training a machine learning algorithm, these names could be seen as nuisance pieces of information that we do not, in fact, wish to learn. Thus, removing them is necessary for multiple reasons.

There have been numerous attempts to perform medical image de-identification. Unfortunately, the solutions posed have suffered from a lack of success, are operating system specific, or are available only for a fee through proprietary vendors or researchers [1–4]. The other issue that I noticed when reading through these is that researchers sought to mask/remove ONLY identifying text and maintain helpful but scientific text intact. And while this is a noble goal, this is a REALLY difficult problem. This owes to the fact that PHI can take varied formats depending on the equipment vendor and the hospital system. We want to keep scientific information, but in the case of an image stack, this information is repeated in every frame. Meaning, it consumes redundant space in the image — so if we can remove it once, we can save storage space! Therefore, can think about these as our design constraints. So with those in mind, let’s solve the problem with all these in mind simultaneously.

We employ machine learning in the form of a recurrent convolutional neural network. To begin the pipeline, the first task is to identify, extract, and mask text-based data found in the arrays. This process utilizes PyTorch as the framework for text detection. The text recognition model is based on a convolutional recurrent neural network (CRNN), which was trained on the IC13, IIIT5k, and SVT datasets. The model comprises three key components:

a) Feature extraction, achieved through a combination of ResNet (a convolutional neural network) and the visual geometry group (VGG) neural network. This is responsible for detecting features that look like letters.

b) Sequence labeling, accomplished using a Long, Short-Term Memory network (LSTM). The recurrent part of this is important to ensure features that are next to one another in the image that appear as text are assumed to be grouped together — meaning they form a coherent word(s).

c) On top of this, there is a connectionist temporal classifier (CTC) that acts to perform optical character recognition (OCR). This is responsible for transcribing the detected word(s) into letters based on an English lexicon. Thus, decoding is carried out by the CTC.

An overview of data cleaning and anonymization involves loading data, which may be in the form of DICOM, JPEG stack, or video file, and subjecting it to text detection and removal through an OCR and masking procedure. Any identified text is extracted and saved to a separate .csv file (below). If the user decides to obfuscate the file names, a random series of alphanumeric characters are concatenated and the original filename is further added to the .csv file, which now simultaneously serves as a cross-walk file. Additionally, the images are improved by eliminating any extraneous elements. The Region of Interest (ROI) is then isolated using a sequence of filtering, morphological, and geometric operations (of which there are various versions explained below).

Image by author

In this work, we showcase the usefulness of our available algorithms and provide guidance on how end-users can integrate them into their respective applications. Our software, PyLogik, can be installed through the terminal using the command pip install PyLogik. We have designed several functions that can be used either jointly or separately in the pipeline. Our software supports various image types, including 2D (grayscale), 3D (grayscale with multiple frames or 3-channel RGB), and 4D (multiple frames with RGB information), and can read dicom, .png, .jpg, .jpeg, and .nii (NIfTi) image types. Any skipped files and their processing details are recorded in log files in the destination folder.

The general workflow of my program is as follows:

Image by author

Our functions can be imported and utilized in the following manner. Install the library using the terminal:

$ pip install pylogik

Import libraries:

from pylogik import deid
from pylogik import im_analysis

There are various functions available to the user:

image by author

This methodology was initially designed around the notion of de-id’ing ultrasound images but was then built on and expanded to encompass other imaging modalities that may not want to be as restrictive on the ‘cleaning’ portion of the image. We’ll briefly touch on the various options.

  1. This is for de-id ONLY. This only removes burned-in text from the image, writes it to a .csv file, with the file name (for crosswalk purposes), and writes image frame(s) to lossless JPEG(s).
deid.deid(input_path = "path_to_files", output_path="path_to_save_files",
rename_files = False, threshold = 0)
  • input_path : path to image files (DICOM, JPEG, or video)
  • output_path : path to save new image files and .csv text files
  • rename_files : False (default) change filename to a series of 10 ran-
    domly selected alphanumerics
  • threshold : 0 (default) integer value of the threshold in the image (default = 0). If unclear to the user, can use default or use color select tool to capture background intensity from the sample image

2. This is for de-id and cleaning specific to ultrasound data. This was actually the first one I created, and it has some cool geometric comparisons that run to just keep a very clean ROI. This removes burned-in text from the image, writes it to a .csv file (with the file name for crosswalk purposes), processes and compresses images according to methods outlined in the associated paper, and writes image frame(s) to lossless JPEG(s).

deid.deid_us(input path = "path_to_files", output path="path_to_file_save",
rename_files=False, thresh=0)
  • input_path: path to image files (DICOM, JPEG, or video)
  • output_path: path to save new image files and .csv text files
  • rename_files: False (default) changes the filename to a series of 10 randomly selected alphanumerics
  • threshold : 0 (default) integer value of the threshold in the image (default = 0). If unclear to the user, can use default or use color select tool to capture background intensity from sample image

3. This is for de-id and cleaning, where only the largest salient item in the image is saved. This removes burned-in text from the image, writes it to a .csv file (with the file name for crosswalk purposes), keeps the single most salient item in the picture — compresses accordingly, and writes image frame(s) to lossless JPEG(s).

deid.deid_one(input_path = "path_to_files", output_path="path_to_file_save",
rename_files=False, threshold = 0)
  • input_path: path to image files (DICOM, JPEG, or video)
  • output_path: path to save new image files and .csv text files
  • threshold : 0 (default) integer value of the threshold in the image (default = 0). If unclear to the user, can use default or use color select tool to capture background intensity from the sample image
  • rename_files: False (default) changes the filename to a series of 10 randomly selected alphanumerics

4. This is for de-id and cleaning, where only small objects are filtered out, and multiple large entities will remain in the image. This removes burned
in text from the image, writes it to a .csv file, with the file name (for
crosswalk purposes) and writes image frame(s) to lossless JPEG(s)(removes/extracts text and removes small scale features)

deid.deid_clean((input_path = "path_to_files", output_path="path_to_save_files",
rename_files=False, threshold = 0)
  • input_path: path to image files (DICOM, JPEG, or video)
  • output_path: path to save new image files and .csv text files
  • rename_files: False (default) change the filename to a series of 10 randomly selected alphanumerics
  • threshold : 0 (default) integer value of the threshold in the image (default = 0). If unclear to the user, can use default or use color select tool to capture background intensity from sample image

5. This is if you have image(s) you want to simply detect/readout the text to a CSV. However, do not wish to output any images. This only finds text in the image and writes it to a series of CSV files in the specified output folder, it does not write images.

deid.find_txt(input_path = "path_to_files", output_path="path_to_save_files")
  • input_path: path to image files (DICOM, JPEG, or video)
  • output_path: path to save new image files and .csv text files
  • thresh: integer value of the threshold in the image (default = 0). If
    unclear to the user, can use default or use color select tool to capture back-
    ground intensity from the sample image

6. These are some additional functions contained in the package that may be of use for calculating and presenting dice scores.

A) Dice score calculation:

im_analysis.dice_score(pred_array, true_array, k=1)
  • pred — array of the predicted segmentation
  • true — array of the ground truth segmentation
  • k — value to perform matching on (default = 1)
  • Returns: dice score (float)

B) Visualization of dice calculation

im_analysis.imshowpair(pred_array, true_array, color1 = (124,252,0), color2 =
(255,0,252), show_fig=True)
  • pred_array — array of the predicted segment-
    tation
  • true _array— an array of the ground truth segmentation
  • color1 — first color to show unique values from the first image
  • color2 — second color to show unique values from the second image
  • Returns: array and graphical plot
Image by author
Image sourced from [1] (this is using the ‘deid_us’ function)

Upon revisiting the aims set forth in the introduction, we have successfully developed a robust protocol in terms of de-identification, sequestering of relevant patient information, ROI identification, and file compression of medical images. Previous work has focused on training CNNs to detect and remove solely PHI-related information contained within the image, with varying levels of success ranging from 65–89% [1–3]. However, some of these techniques are operating system-specific or only available at a cost [4]. The PyLogik package addresses these problems by ensuring the removal of direct patient identifiers while converting the text file format to .csv file output, correctly identifying the ROI, and compressing information. Additionally, the protocol is OS-agnostic and free of charge for researchers. By simplifying the deep learning issue and removing all text, PyLogik overcomes the risk of distinguishing characters such as “Bg” from “B9”, “B1” from “Bl”, or “B0” from “Bo”. This allows individual sites to place the necessary context-specific filtering back on their .csv files and ensures a higher efficacy of PHI removal. PyLogik can run on any OS and is free to download, thus enabling it to run on servers behind institutional firewalls. By extracting and subsequently masking all text, the .csv files output by the pipeline allow end-users to query, include, or destroy information for their specific uses. Our strategy also facilitates better multimodal integration of data information. For example, in echocardiographic images, the heart rate is often displayed as text in each view; in PyLogik, this information is retained and made available to end-users, thus making it available for use during information fusion (early, joint, and late) in algorithm development [5]. Images are saved and output as JPEG stacks to decrease the number of specialized libraries and coding platforms needed to re-import the images for processing [6]. By truncating the image to only contain the ROI(s), we retain only salient information, thus facilitating compression on secondary non-PACS servers.

Image by author

In addition to providing an efficient de-identification and image-cleaning protocol to facilitate leveraging ultrasound images on aggregate for algorithm development, our proposed method offers up to 72% compression in comparison to the original DICOM files. Not only does this have implications for the long-term storage of these large files, but it also allows for substantially increased short-term storage for applications in machine learning (i.e., batch processing). These images are saved as lossless JPEGS, where ‘lossless’ means that the ROI saved has the same spatial resolution as that present in the original image format. This package is designed to be modular, with a separate class for those seeking the de-identification procedure partition of the pipeline solely. This part of the pipeline processing may be easily extended to other imaging modalities such as magnetic resonance imaging (MRI), computed tomography (CT), and other radiographs, etc. Our program provides a state-of-the-art (SOTA) deidentification algorithm applicable to multiple medical imaging modalities, while simultaneously offering imaging compression (up to 72% smaller) while simultaneously prepping data for machine learning experiments. This compression is important as it has implications for long-term cloud-based storage as well as memory when training machine learning algorithms and is not discussed in other publications of this nature. To this end, we have developed an open-source Python library, PyLogik. It is easy to install, operating system agnostic, can run behind institutional firewalls while simultaneously making use of GPU computing if available and performs batch processing. We make this tool freely available to researchers as an alternative to the expensive fee-for-service or less efficacious free options currently available.

While automated data cleaning is desirable, rarely is an automated de-identification effort perfect. The risk of PHI leakage is of utmost concern due to the legal and ethical ramifications. We urge the research community to test the protocol on their respective systems for image de-identification. Future work includes updates to the software package and incorporating feedback to make it more generalizable (including changing output formats) as adoption grows.

For more info on the function calls available and other documentation, read the paper here.

Image by author

References

[1] E. Monteiro, C. Costa, J. L. Oliveira, A de-identification pipeline for ultrasound medical images in dicom format, Journal of medical systems, 41 (5) (2017) 1–16.
[2] L. Fezai, T. Urruty, P. Bourdon, C. Fernandez-Maloigne, Deep anonymization of medical imaging, Multimedia Tools and Applications (2022) 1–15.
[3] L.-C. Huang, H.-C. Chu, C.-Y. Lien, C.-H. Hsiao, T. Kao, Privacy preservation and information security protection for patients’ portable electronic health records, Computers in Biology and Medicine 39 (9) (2009) 743–750.
[4] D. Rodriguez Gonzalez, T. Carpenter, J. I. van Hemert, J. Wardlaw,
An open-source toolkit for medical imaging de-identification, European
radiology 20 (8) (2010) 1896–1904
[5] A. Kline, H. Wang, Y. Li, S. Dennis, M. Hutch, Z. Xu, F. Wang,
F. Cheng, Y. Luo, Multimodal machine learning in precision health:
A scoping review, npj Digital Medicine 5 (1) (2022) 1–14
[6] B. Liu, M. Zhu, Z. Zhang, C. Yin, Z. Liu, J. Gu, Medical image con-
version with dicom, in: 2007 Canadian Conference on Electrical and
Computer Engineering, IEEE, 2007, pp. 36–39
[7] A. Kline, V. Appadurai, Y. Luo, S. Sanjiv, “Medical Image Deidentification, Cleaning and Compression Using Pylogik”, https://arxiv.org/abs/2304.12322



Source link

Leave a Comment