Towards a Scalable Clinical Data Annotation and Processing Pipeline to Support Cancer Surveillance

Presented By

Paul A. Fearn, Ph.D., MBA, Chief, Surveillance Informatics Branch, Surveillance Research Program, Division of Cancer Control and Population Sciences, National Cancer Institute (NCI)


Central cancer registries that are funded by the NCI’s Surveillance, Epidemiology, and End Results (SEER) program collect cancer diagnosis, treatment, and survival data for about 30% of the US population. This national resource fuels thousands of cancer research projects and national cancer statistics. The SEER program is investing in tools, processes, and pilot projects to advance, standardize, and scale the application of computation (e.g., natural language processing, machine learning) for information extraction, de-identification, and data quality improvement. The LabKey Server NLP Pipeline is an integral part of this work.

This presentation will cover the background and goals of the SEER program, work in-progress to create and scale processes for clinical annotation and automation, and the role and enhancements of LabKey to support pilot projects.

