LabKey Kicks Off Abstraction and NLP Pipeline Project for NCI SEER

In support of the National Cancer Institute (NCI) and the Department of Energy (DoE) initiative to use large-scale computing to influence cancer science, NCI’s Surveillance, Epidemiology, and End Results (SEER) Program has partnered with LabKey to develop an abstraction workflow and Natural Language Processing (NLP) pipeline that will automate the annotation and review of free-text pathology reports.

The NCI SEER Program works to provide information on cancer statistics in an effort to reduce the burden of cancer among the U.S. Population. SEER currently collects and publishes cancer incidence and survival data from population-based cancer registries covering approximately 30 percent of the U.S. population. The registries receive at least one unstructured pathology report on the more than 450,000 cases reported annually that are used in conjunction with other sources to abstract relevant information on the cases.

The initial version of the application will allow SEER to identify and select pathology reports of interest using a Linguamatics-based text-mining tool and make them available in a LabKey Server portal for manual annotation and abstraction of key elements. An annotation and task management pipeline will manage the stages of the abstraction, annotation, and review process, ensuring consistency by standardizing tasks and automating the workflow. To ensure the security of data being processed and generated, the application will utilize LabKey Server’s role-based security model and facilities that ensure regulatory compliance.

Future phases of development will introduce the use of Natural Language Processing engines to further accelerate the annotation process. Manually abstracted data from the initial phase of the project will be used by DoE laboratories to help develop and train NLP algorithms that will then be used to automate the abstraction of large numbers of pathology reports.

