NA-MIC Project Weeks

Back to Projects List

DICOM metadata databases

Key Investigators

Marco Nolden (German Cancer Research Center, Germany)
Andrey Fedorov (BWH, USA)
Steve Pieper (Isomics Inc., USA)

Project Description

Medical imaging applications and systems which manage large collections of DICOM images usually need some kind of database to allow for browsing and selecting images or image collections, to support curation and control of ML training tasks, batch analysis etc. Goal of the project is to investigate existing and new approaches to handle the metadata of large image collections for different purposes, create experimental setups, and report on results.

DICOM objects contains rich metadata
depending on the use case, record linkage to non-imaging data might be an additional requirement
extracted metadata can be represented in different JSON styles, stored in document databases like CouchDB, Apache OpenSearch etc..
there is a FHIR imaging study (https://www.hl7.org/fhir/imagingstudy.html), FHIR data could be stored in FHIR stores, or regular SQL databases …
custom approaches, like the CTK DICOM database, or IDC’s representation in BigQuery; one has also observed flattened FHIR in SQL databases, combined with object stores etc.
DICOM to JSON could be done according to the DICOM JSON model (https://dicom.nema.org/medical/dicom/current/output/chtml/part18/chapter_F.html) , e.g. using DCMTK, or custom approaches, but also generic metadata extractors like Apache Tika could be an option

Objective

Objective A. A report on the experiments and their results.

Approach and Plan

put DICOM JSON in JSON columns of sqlite or postgres, test jsonpath and similar
create FHIR imaging studies, put to FHIR endpoint or other databases
Connect with out-of-the-box visualization solutions for e.g. json documents

Progress and Next Steps

Generate DICOM JSON representation of some TCIA datasets using pydicom
Read about JSON and JSONB columns in SQLite
Try different queries

[!CAUTION] SQLite’s arrow operator -> seems to behave differently if the key consists solely of digits. This can lead to surprises when working with DICOM tags whose hexadecimal representation sometimes contains letters, but sometimes only digits …

Count images per modality

SELECT COUNT(*) AS [Number of records], json_extract(jsonb_data, "$.00080060.Value")
FROM dicom_files
GROUP BY json_extract(jsonb_data, "$.00080060.Value");

It’s also possible to create an index on fields, e.g. for modality

CREATE INDEX modality ON dicom_files (json_extract(jsonb_data, "$.00080060.Value") );

Count number of series per modality

-- Count number of series per modality
-- SQLite
SELECT
 json_extract(jsonb_data, "$.00080060.Value"),
 COUNT(DISTINCT(json_extract(jsonb_data, "$.0020000D.Value"))) AS num_series
FROM
 dicom_files
GROUP BY
 json_extract(jsonb_data, "$.00080060.Value")
ORDER BY
 num_series desc

-- Count number of series per modality
-- IDC BigQuery
SELECT
 Modality,
 COUNT(DISTINCT(SeriesInstanceUID)) AS num_series
FROM
 `bigquery-public-data.idc_current.dicom_all`
GROUP BY
 Modality
ORDER BY
 num_series desc

Find all series which contain a LOCALIZER ImageType

-- Find all series which contain a LOCALIZER ImageType
-- SQLite JSON
SELECT DISTINCT
 json_extract(jsonb_data, "$.0020000D.Value")
FROM
 dicom_files, json_each (json_extract(jsonb_data, "$.00080008.Value") )
 WHERE
json_each.value IS 'LOCALIZER'

-- Find all series which contain a LOCALIZER ImageType
-- IDC BigQuery
WITH
 ImageTypeAgg AS (
 SELECT
   ARRAY_TO_STRING(ImageType,'/') AS image_type_str,
   SeriesInstanceUID
 FROM
   `bigquery-public-data.idc_current.dicom_all` )
SELECT
 SeriesInstanceUID,
 image_type_str
FROM
 ImageTypeAgg
WHERE
 image_type_str LIKE "%LOCALIZER%"

Findings and future experiments

putting the standard JSON representation of DICOM files in SQLite works
json functions of SQLite can be used to extract elements
queries seem to be reasonably fast, database indices can be created on elements to speed this up, e.g. ~5x for modality

Ideas for future experiments:

validate results by comparing them to IDC BigQuery results for selected collections
measure performance in a systematic way
replace numeric tag values with names, makes queries more readable but might also increase database size
compare performance and size with PostgreSQL
use alternative JSON structure from the proposed DICOM Supplement 219 , Presentation . There is also a trial implementation by @dclunie which could be evaluated for this.

-

Sample code to do so could look like this:

#!/usr/bin/env python

"""Populate SQLite with DICOM metadata.

This code recursively crawls a directory and populates a SQLite database with JSON representations of
the DICOM files found.

Warning: highly experimental code, use at your own risk!

"""

import pydicom
import os
import sys
import sqlite3
import json

def bulk_data_handler(data_element):
    # we are not interested in any bulk data
    return None

crawl_directory = sys.argv[2]
db_path = sys.argv[1]

# Connect to the database
conn = sqlite3.connect(db_path)
cursor = conn.cursor()

# Create the table if it doesn't exist
cursor.execute('''
CREATE TABLE IF NOT EXISTS dicom_files (
    filename TEXT PRIMARY KEY NOT NULL,
    jsonb_data BLOB
)
''')

conn.commit()

# iterate over crawl directory, and insert json dumps of all files to the database
for root,dirs,files in os.walk(crawl_directory):
    for file_name in files:
        dicom_file_path = os.path.join(root,file_name)
        json_data = ""

        # check if file is already in the database
        cursor.execute("SELECT filename FROM dicom_files WHERE filename = ?", (dicom_file_path,))
        data=cursor.fetchone()

        if data:
            continue

        try:
            ds = pydicom.dcmread(dicom_file_path)
            json_data = ds.to_json_dict(bulk_data_element_handler=bulk_data_handler)

            # print(json_data)

        except pydicom.errors.InvalidDicomError:
            print("Skipped %s", dicom_file_path)
            continue

        cursor.execute('''
        INSERT INTO dicom_files (filename, jsonb_data )
        VALUES ( ? , jsonb( ? ) )
        ''', (dicom_file_path, json.dumps(json_data) ) )

        conn.commit()

        print(f"Inserted DICOM file: {dicom_file_path}")

conn.close()

https://github.com/nolden/namic-pw24/

Illustrations

No response