Medical imaging applications and systems which manage large collections of DICOM images usually need some kind of database to allow for browsing and selecting images or image collections, to support curation and control of ML training tasks, batch analysis etc. Goal of the project is to investigate existing and new approaches to handle the metadata of large image collections for different purposes, create experimental setups, and report on results.
DICOM objects contains rich metadata
depending on the use case, record linkage to non-imaging data might be an additional requirement
extracted metadata can be represented in different JSON styles, stored in document databases like CouchDB, Apache OpenSearch etc..
there is a FHIR imaging study (https://www.hl7.org/fhir/imagingstudy.html), FHIR data could be stored in FHIR stores, or regular SQL databases …
custom approaches, like the CTK DICOM database, or IDC’s representation in BigQuery; one has also observed flattened FHIR in SQL databases, combined with object stores etc.
DICOM to JSON could be done according to the DICOM JSON model (https://dicom.nema.org/medical/dicom/current/output/chtml/part18/chapter_F.html) , e.g. using DCMTK, or custom approaches, but also generic metadata extractors like Apache Tika could be an option
[!CAUTION] SQLite’s arrow operator -> seems to behave differently if the key consists solely of digits. This can lead to surprises when working with DICOM tags whose hexadecimal representation sometimes contains letters, but sometimes only digits …
SELECT COUNT(*) AS [Number of records], json_extract(jsonb_data, "$.00080060.Value")
FROM dicom_files
GROUP BY json_extract(jsonb_data, "$.00080060.Value");
It’s also possible to create an index on fields, e.g. for modality
CREATE INDEX modality ON dicom_files (json_extract(jsonb_data, "$.00080060.Value") );
-- Count number of series per modality
-- SQLite
SELECT
json_extract(jsonb_data, "$.00080060.Value"),
COUNT(DISTINCT(json_extract(jsonb_data, "$.0020000D.Value"))) AS num_series
FROM
dicom_files
GROUP BY
json_extract(jsonb_data, "$.00080060.Value")
ORDER BY
num_series desc
-- Count number of series per modality
-- IDC BigQuery
SELECT
Modality,
COUNT(DISTINCT(SeriesInstanceUID)) AS num_series
FROM
`bigquery-public-data.idc_current.dicom_all`
GROUP BY
Modality
ORDER BY
num_series desc
-- Find all series which contain a LOCALIZER ImageType
-- SQLite JSON
SELECT DISTINCT
json_extract(jsonb_data, "$.0020000D.Value")
FROM
dicom_files, json_each (json_extract(jsonb_data, "$.00080008.Value") )
WHERE
json_each.value IS 'LOCALIZER'
-- Find all series which contain a LOCALIZER ImageType
-- IDC BigQuery
WITH
ImageTypeAgg AS (
SELECT
ARRAY_TO_STRING(ImageType,'/') AS image_type_str,
SeriesInstanceUID
FROM
`bigquery-public-data.idc_current.dicom_all` )
SELECT
SeriesInstanceUID,
image_type_str
FROM
ImageTypeAgg
WHERE
image_type_str LIKE "%LOCALIZER%"
Ideas for future experiments:
-
Sample code to do so could look like this:
#!/usr/bin/env python
"""Populate SQLite with DICOM metadata.
This code recursively crawls a directory and populates a SQLite database with JSON representations of
the DICOM files found.
Warning: highly experimental code, use at your own risk!
"""
import pydicom
import os
import sys
import sqlite3
import json
def bulk_data_handler(data_element):
# we are not interested in any bulk data
return None
crawl_directory = sys.argv[2]
db_path = sys.argv[1]
# Connect to the database
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
# Create the table if it doesn't exist
cursor.execute('''
CREATE TABLE IF NOT EXISTS dicom_files (
filename TEXT PRIMARY KEY NOT NULL,
jsonb_data BLOB
)
''')
conn.commit()
# iterate over crawl directory, and insert json dumps of all files to the database
for root,dirs,files in os.walk(crawl_directory):
for file_name in files:
dicom_file_path = os.path.join(root,file_name)
json_data = ""
# check if file is already in the database
cursor.execute("SELECT filename FROM dicom_files WHERE filename = ?", (dicom_file_path,))
data=cursor.fetchone()
if data:
continue
try:
ds = pydicom.dcmread(dicom_file_path)
json_data = ds.to_json_dict(bulk_data_element_handler=bulk_data_handler)
# print(json_data)
except pydicom.errors.InvalidDicomError:
print("Skipped %s", dicom_file_path)
continue
cursor.execute('''
INSERT INTO dicom_files (filename, jsonb_data )
VALUES ( ? , jsonb( ? ) )
''', (dicom_file_path, json.dumps(json_data) ) )
conn.commit()
print(f"Inserted DICOM file: {dicom_file_path}")
conn.close()
https://github.com/nolden/namic-pw24/
No response