IAA experiments have been conducted on a subset of six documents (two short, two medium, and two long ones), for a total of around 2,000 words, by three annotators and are reported by annotator pairs.
For IAA on EVENTs, TIMEX3s, ACTORs, and RMLs, we used f-measure and we considered all annotations. For IAA on attributes we used Cohen's kappa and the simple percentage of agreement (notice that these values only take into consideration the span-based annotations that have been identifies by both annotators in a pair).
For Clinical entities we also used f-measure and we considered all annotations; for IAA on attributes (CUI selection for entityID), on the other hand, we used accuracy (notice that, again, the reported values only take into consideration the span-based annotations that have been identifies by both annotators in a pair).
IAA for the different types of TLINKs and ALINKs annotation is reported below in terms of accuracy between annotators (accuracy considers only relations identified by both annotators).
As far as PERTAIN-TO relations are concerned, IAA has been assessed between two annotators. On average, each annotator has identified 111 relations. The resulting Dice’s coefficient is 0.87, which is quite high given that agreement between annotators is only considered as such when there is a complete overlap in the spans of the source and the target (exact match). More specifically, the inter-annotator agreement is particularly high when numerical values are present in the RMLs (it reaches 0.92 in terms of Dice’s coefficient), while it is slightly lower (Dice=0.84) in the case of RMLs without numerical values.