About the project
E3C aims to collect and annotate a multilingual corpus of clinical narratives, ambitioning to become a reference European resource. A clinical narrative is a statement of a clinical practice, presenting the reason for a clinical visit, the description of physical exams, and the assessment of the patient’s situation. We focus on published clinical narratives because they are often de-identified, overcoming privacy issues, and are rich in clinical entities as well as temporal information, which are almost absent in other clinical documents (e.g. radiological reports). E3C will deal with three types of clinical narratives: discharge summaries, clinical cases published in journals, and clinical cases from medical training resources.
E3C will build a 5-language (Italian, English, Spanish, French and Basque) clinical narrative corpus to allow linguistic analysis, benchmarking, and training of information extraction systems. The project will build upon available resources (distributed under open access licenses) and collect new data when necessary. The goal is to harmonise current annotations, introduce new annotation layers, and provide baselines for information extraction tasks.
We foresee three types of annotations:
clinical entities: pathologies, symptoms, procedures, and body parts, according to standard clinical taxonomies (e.g. ICD-10 and SNOMED-CT);
temporal information: events, time expressions and temporal relations, according to the THYME TimeML standard;
factuality: event factuality values and assessment of the effect of negation, uncertainty and hedge expressions on those values.
E3C is organised into three layers, with different purposes:
The first layer consists of full manual annotations of clinical entities, temporal information and factuality, for benchmarking and linguistic analysis;
The second layer consists of semi-automatic annotations of clinical entities, to be used to train baseline systems;
The third layer consists of non-annotated medical documents (not necessarily clinical narratives) to be exploited by semi-supervised approaches.
The E3C project is organised in the following main activities:
Guideline definition. The definition of the guidelines for the annotation of clinical entities, temporal information, and factuality in the five languages will be based on the harmonisation and extension of existing annotation schemas.
Data collection. We will collect clinical cases for the different languages exploiting publicly available bibliographic repositories and online medical-encyclopedic resources. All the documents will be automatically pre-processed with tokenisation, lemmatisation and part-of-speech tagging.
Data annotation. Temporal information and factuality will be annotated by expert linguistic annotators, while we will benefit from the advice of medical experts for clinical entity annotation. Clinical entities will be annotated according to standard clinical taxonomies (e.g. SNOMED and ICD-10), while the annotation of temporal information and factuality will be based on THYME.
Quality assessment. The annotation of the first layer will be manual (or manually checked) and we will perform Inter-Annotator Agreement experiments to assess the quality of both the annotations and the annotation guidelines. The second layer will contain semi-automatic annotations, where a subset of the data will be checked manually to assess annotation quality.
Baselines. We will develop baselines for clinical entity, temporal information, and factuality extraction based on L2 semi-automatic annotations as training data, serving a twofold purpose: will enable us to show the usefulness of our manual annotations in practice and will foster research in the field.
Integration. The multilingual corpus and the baseline information extraction system will be made available to the community through the ELG platform.