Print Friendly

The Road to Better Care – and Outcomes

The widespread use of free-form narrative text and nonstandardized data in electronic health record (EHR) systems and subsystems is the single greatest impediment to interoperability and the development of smoothly running and genuinely beneficial data-exchange platforms.

Why is free-form text a problem for interoperability?

Unstructured data hinder the development of data warehouses, analytics, and decision-support tools, thwarting our ability to use available information to decrease adverse events and medical costs, as well as develop better treatments. It also diminishes our ability to aggregate, parse, analyze and consolidate data from disparate medical and ancillary sources so that concise, correct and complete patient summaries and care plans can be generated and shared. At the bottom line, the prevalence of unstructured data impedes the development of solutions to improve the quality of care and outcomes for patients.

How does the California Cancer Registry (CCR) plan to support interoperability?

CCR plans to establish a data-exchange platform that employs standardized, structured cancer incidence data. CCR’s goals include enabling EHR systems to utilize these data, including to further interoperability beyond cancer. In order to build this platform and achieve our goals, however, significant changes will have to be made.

The reporting of cancer incidence data will need to be instituted using structured data, supported by standards that are maintained across the entire state. Structured data allow for the automation of linkage and consolidation processes that contribute to interoperability. An official vocabulary for cancer terms also will have to be adopted, so that every hospital and doctor can speak the same official cancer language and not a cancer “dialect” when, for example, describing a tumor.

How is a tumor diagnosed?

A cancer diagnosis starts with a tissue analysis and a pathology report. Until a pathology report identifies cancer, it is still “maybe,” “likely” or “suspicious for” cancer. The tissue diagnosis is the gold standard. The pathology reports, though, are generally seminarrative or semistructured, and the language used is the dialect of the local community in which the pathologist works.

The reporting of cancer incidence data will need to be instituted using structured data, supported by standards that are maintained across the entire state.

Importantly, the report has a diagnosis section in which the most vital information–the diagnosis itself–is found. The content within that diagnostic information must be sufficiently complete to drive treatment decisions. For example, after tumors are measured, the measurement becomes part of their “stage,” essentially marking how far along they are in their potential to spread.

A grade is assigned that is a measure of how indolently versus how aggressively the tumors will behave. Different tumor types have other measurements attached that impact how a patient is treated, such as the Estrogen Receptor and Progesterone Receptor status. With the progress of precision medicine, the list of required content can grow even longer. These different pieces of information are the individual “data elements” that combine to form the content of the diagnosis. Without this content, the patient cannot be started on a therapeutic regimen.

How do “data” address the diagnosis?

To address the issue of insuring that pathology reports are sufficiently complete to start therapy, the College of American Pathologists (CAP)–the leading professional organization of pathologists for setting standards in reporting and testing – has developed a set of “cancer checklists” that define required content. This content is made up of a series of data elements, which are typically related in hierarchical fashion within a schema, or an ontological tree.

In addition, each data element has a set of allowable values from which the appropriate “data value” can be selected. Data elements describe the content of the data unit, and the data values are the list of possibilities that can be used to define the content for a unique patient. As an example, the tumor type tree has “histological type” as a member data element and “invasive lobular carcinoma” as an allowable data value for that data element.

CAP has taken this schema a step further, using it to provide a standard means of transferring discretized data out of the pathology reporting system. CAP has worked with software vendors to have these data captured by direct input by the pathologist. These bundles of discretized data are called the eCCs, for electronic Cancer Checklists. The pathologist can complete the seminarrative diagnosis section and then select the appropriate data values for each data element, generally in a worksheet. This is called discretized data capture (DDC).

How are data collected right now?

An alternative to structured data is natural language processing (NLP), a combination of artificial intelligence and computational linguistics that attempts to recognize first which data element is being defined, and then to translate the individual pathologist’s “dialect” into the allowable data values. As an example, a breast cancer diagnosis line may read: “INVASIVE LOBULAR CARCINOMA, CLASSIC TYPE, NOTTINGHAM GRADE 2, 1.2 CM”

The NLP algorithm has to break these up as histologic diagnosis, histologic subclass, tumor grading system, tumor grade, and tumor size. Note that there is a comma between all of these items, except for tumor grading system and tumor grade, even though they are two different data elements. This complicates the situation, as a simple rule of breaking down data elements using the comma as a delimiter would not work. Also consider that other pathologists may utilize a different sequence of data elements.

In addition, terminologies may vary. For some pathologists, “invasive mammary carcinoma” means a mixture of several histologic types of breast cancer. Other pathologists use it to mean that it is not of a special type. It is difficult for an NLP algorithm to determine which “dialect” the pathologist is using. There are additional complexities, too. For example, using single-term recognition to recognize cancer cases – such as cancer in a diagnosis line – means that “no cancer present” would be identified as a cancer case.

It turns out that within a closed, clinical community, a single dialect with minimal variability is generally adopted. As such, an NLP algorithm could be programmed to that community’s terminology with a fairly low error rate. However, if the NLP has to be tuned in to multiple communities or recalibrated on a regular basis due to “drift,” or even version changes in the eCCs, then the problem remains and the burden of error rates can be high.

Do the advantages outweigh the disadvantages to using DDC? Yes!

DDC does have disadvantages. The pathologist will have to take the time to manually select the data elements and assign their values while generating a report. There needs to be widespread acceptance of the defined standards, along with the willingness to invest time into filling out worksheets. Some of these disadvantages can be mitigated by using templates that are structured similarly across multiple tumor types, and that are designed to create hierarchies that reduce “clicks” (computer selections).

A major advantage of DDC is that electronically captured data can be electronically deposited directly into a repository with little or no manual processing required. Electronic capture provides for efficient, detailed, timely, and accurate data collection. This extends the life cycle of the pathology report and allows for technology to be created or deployed in support of data use.


State and national cancer registry programs, which historically have not been widely viewed as part of the healthcare continuum, can play an important role within a meaningful data-exchange platform for cancer diagnosis and treatment, sitting atop a strong information foundation of interoperable data. The new vision of real-time cancer surveillance includes real-time communication with cancer-data reporting facilities. Real-time communication must include interoperable data.

CCR is actively working toward a reality in which cancer incidence data are received, aggregated, and stored automatically. Real-time data storage provides multiple advantages for CCR and reporting facilities, including descriptive and predictive analytics derived from structured and interoperable data. Through dissemination of descriptive and predictive analytics, CCR has prioritized working with California’s health systems with the bottom-line objective of improving patient outcomes.