TESTLINK@IberLEF 2023

Main

Organization

Important dates

For participants

Data usage considerations

TESTLINK at IberLEF2023 is a relation extraction task based on clinical cases taken from the E3C corpus, i.e. on Spanish and Basque written documents reporting statements of a clinical practice (thus including, for example, the reasons for a clinical visit, the physical exams undertaken, the assessment of the patient’s diagnosis and subsequent treatments).

The task consists in identifying test results and measurements and linking them to the textual mentions of the laboratory tests and measurements from which they were obtained.

The Codalab competition for TESTLINK is already set up.

Submit your annotated files for evaluation through it for evaluation.

Main

Organization

Bernardo Magnini - FBK
Begoña Altuna - HiTZ Center - Ixa, UPV/EHU
Alberto Lavelli - FBK
Manuela Speranza - FBK
Roberto Zanoli - FBK
Rodrigo Agerri - HiTZ Center - Ixa, UPV/EHU
José Javier Sáiz - Barcelona Supercomputing Centre
Lidia Salas - University of the Basque Country
Goutham Venkatesh Karunakaran - University of Trento

Important dates

27-02-2023: Development data available
17-04 to 08-05-2023: Evaluation window
19-05-2023: System assessment deadline
02-06-2023: Final reports due
16-06-2023: Final report review deadline
30-06-2023: Camera ready version deadline

For participants

Check the Codalab competition for TESTLINK for submissions.

Task Description

Laboratory tests are a common step in the disease and disorder diagnosis processes and, as a part of them, they are documented in clinical narratives. The main goal of the TESTLINK task is finding test results and measurements and linking them to the textual mentions of laboratory tests and measurements from which they were obtained. For this, both elements need to be identified in text and then the relation between them needs to be created.

Laboratory tests and measurements and their results provide interesting information on the patients’ status in a certain time of the development of the disorder but have been given little attention in the last years. The task also brings up a new data treatment perspective since the extraction of lab values and vital signs cannot be handled as a named entity recognition task, as it requires interpreting numeric values and ranges.

In the following example, the test event is marked in blue and the result is marked in red:

Both the test events and the results are defined as text strings. All test events are single-token strings (i.e. only the syntactic head is considered, following the THYME framework), while results may be represented by multitoken strings (i.e. a whole syntactic chunk). There are not discontinuous spans nor nested entities.

All relations are of the type PERTAINS (see the red arrows in the examples) and can be one-to-one, one-to-many and many-to-one.

Data

Training datasets

Spanish

Documents: 81

Tokens: 28,815

Annotated Relations: 597

Source: E3C corpus

Download

Basque

Documents: 90

Tokens: 34,052

Annotated Relations: 1291

Source: E3C corpus

Download

Test datasets

Spanish

Documents: 80

Source: E3C corpus

Download

Basque

Documents: 80

Source: E3C corpus

Download

Data format

All annotation data are made available to participants in a format that owes a lot to the PubTator format. The annotated data consists of a straightforward tab-delimited text file:

Where:

Every document in the dataset is in a new line and a space line is used as a document separator.
- DOCID: document id
- t: marker to identify the lines that contain the text of the documents
- TEXT: text of the document
Every annotated relation is in a separate line and is represented as an ordered pair of entity mentions (i.e. RML,event). Each entity mention in the relationship is expressed by its start and end character offsets. The mention text span can be set but is not mandatory.
- DOCID: document id
- REL: marker to identify the lines that contain the relations of the given document
- RML_START: start character offset of the RML entity mention in the document
- RML_END: end character offset of the RML entity mention in the document
- EVENT_START: start character offset of the EVENT entity mention in the document
- EVENT_END: end character offset of the EVENT entity mention in the document
- RML_TEXT [optional]: text span of the RML entity mention
- EVENT_TEXT [optional]: text span of the EVENT entity mention

For example:

100001|t|Paciente de 65 a. de edad, que presentaba una elevación progresiva de las cifras de PSA desde 6 ng/ml a 12 ng/ml en el último año. Dicho paciente había sido sometido un año antes a una biopsia transrectal de próstata ecodirigida por sextantes que fue negativa. Se decide, ante la elevación del PSA, realizar una E-RME previa a la 2ª biopsia transrectal, en la que se objetiva una lesión hipointensa que abarca zona central i periférica del ápex del lóbulo D prostático. El estudio espectroscópico de ésta lesión mostró una curva de colina discretamente más elevada que la curva de citrato, con un índice de Ch-Cr/Ci > 0,80, que sugería la presencia de lesión neoplásica, por lo que se biopsia dicha zona por ecografía transrectal. La AP de la biopsia confirmó la presencia de un ADK próstata Gleason 6.

100001 REL 94-101 84-87 6 ng/ml PSA

100001 REL 104-112 84-87 12 ng/ml PSA

100001 REL 251-259 185-192 negativa biopsia

100001 REL 619-623 598-604 0,80 índice

Data usage considerations

The TESTLINK data has been extracted from the E3C corpus, which is released under CC-BY-NC-4.0 licence.

There is no data usage restriction for model training. The training data released for the CLinkaRT twin task can be used in this task, as well as any other dataset the participants will considered relevant. Nonetheless, all the data used for training will need to be specified in the system report.

Evaluation

Evaluation scorer

Download

Evaluation procedure

The task has been defined as a relation extraction (RE) task in which the elements taking part in the relation as well as the directionality of the relation are considered. Participating systems are provided with raw text from clinical cases as input and asked to return a list of entity mention pairs for which a relationship exists in the text. For example, given the document reported below:

100001|t|Paciente de 65 a. de edad, que presentaba una elevación progresiva de las cifras de PSA desde 6 ng/ml a 12 ng/ml en el último año. Dicho paciente había sido sometido un año antes a una biopsia transrectal de próstata ecodirigida por sextantes que fue negativa. Se decide, ante la elevación del PSA, realizar una E-RME previa a la 2ª biopsia transrectal, en la que se objetiva una lesión hipointensa que abarca zona central i periférica del ápex del lóbulo D prostático. El estudio espectroscópico de ésta lesión mostró una curva de colina discretamente más elevada que la curva de citrato, con un índice de Ch-Cr/Ci > 0,80, que sugería la presencia de lesión neoplásica, por lo que se biopsia dicha zona por ecografía transrectal. La AP de la biopsia confirmó la presencia de un ADK próstata Gleason 6.

The systems output includes both the document in input and the annotated relations as follows:

100001|t|Paciente de 65 a. de edad, que presentaba una elevación progresiva de las cifras de PSA desde 6 ng/ml a 12 ng/ml en el último año. Dicho paciente había sido sometido un año antes a una biopsia transrectal de próstata ecodirigida por sextantes que fue negativa. Se decide, ante la elevación del PSA, realizar una E-RME previa a la 2ª biopsia transrectal, en la que se objetiva una lesión hipointensa que abarca zona central i periférica del ápex del lóbulo D prostático. El estudio espectroscópico de ésta lesión mostró una curva de colina discretamente más elevada que la curva de citrato, con un índice de Ch-Cr/Ci > 0,80, que sugería la presencia de lesión neoplásica, por lo que se biopsia dicha zona por ecografía transrectal. La AP de la biopsia confirmó la presencia de un ADK próstata Gleason 6.

100001 REL 94-101 84-87 6 ng/ml PSA

100001 REL 104-112 84-87 12 ng/ml PSA

100001 REL 251-259 185-192 negativa biopsia

100001 REL 619-623 598-604 0,80 índice

In the annotated relations, the mention text span (e.g., 6 ng/ml, PSA) can be set but it is not used for evaluation.

We measure RE performance by standard Precision, Recall and F1 measure, in which a relation prediction is considered correct if the start and end character offsets of the two related entity mentions and their order of both in the relation are correct. To perform this evaluation the scorer of BioCreative V CDR task is used, i.e.

> eval_relation.sh PubTator gold_file prediction_file

Where:

PubTator: input files format

gold_file: contains the gold standard annotations

prediction_file: contains the predicted annotations

System papers

System description papers should be formatted according to the uniform 1-column CEURART style. Latex and Word templates can be found in: https://ceur-ws.org/Vol-XXX/CEURART.zip.

The minimum length of a regular paper should be 5 pages. There is no maximum page limit.

Papers must be written in English.

The copyright year command must be changed to \copyrightyear{2023}.

The conference command must be changed to \conference{IberLEF 2023, September 2023, Jaén, Spain}

Eliminate the numbering in the pages of the paper, if there is one, and make sure that there are no headers or footnotes, except the mandatory copyright as a footnote on the first page.

Authors should be described with their name and their full affiliation (university and country). Names must be complete (no initials), e.g. “María García” instead of “M. García”.

Titles of papers should be in emphatic capital English notation, i.e., "Filling an Author Agreement by Autocompletion" rather than "Filling an author agreement by autocompletion".

At least one author of each paper must sign the CEUR copyright agreement. The signed form must be sent along with the paper to the task organizers. Important: it must be physically signed with pen on paper. These are the two agreement variants, select the one that fits your case.

AUTHOR-AGREEMENT (NTP): Authors shall use this form if they included no copyrighted third party material in their paper text (or accompanying sources, datasets). This is the right variant in most cases.
AUTHOR-AGREEMENT (TP): Authors shall use this form if they did include copyrighted third party material in their paper or accompanying material. They must then also attach a copy of the permission by the third party to use this material in the signed author agreement!

In the field Name and year of the event of the CEUR agreement should be written: IberLEF 2023. In the field Editors of the proceedings (editors), the following names must appear: Manuel Montes-y-Gómez, Francisco Rangel, Salud María Jiménez-Zafra, Marco Casavantes, Begoña Altuna, Miguel Ángel Álvarez Carmona, Gemma Bel-Enguix, Luis Chiruzzo, Iker de la Iglesia, Hugo Jair Escalante, Miguel Ángel García-Cumbreras, José Antonio García-Díaz, José Ángel Gónzalez Barba, Roberto Labadie Tamayo, Salvador Lima, Pablo Moral, Flor Miriam Plaza del Arco, Rafael Valencia-García.

The reference to cite the overview paper is the following:

APA style: COMING SOON!
Bibtex: COMING SOON!

The deadline to submit the working notes is: 02-06-2023