Accurate human annotations are essential for evaluating the quality of machine translation (MT) outputs, particularly in sensitive contexts such as crisis communication. This paper presents a detailed analysis of the annotation process used in the ITALERT (Italian Emergency Response Text) corpus, specifically designed to evaluate the performance of neural machine translation (NMT) systems and large language models (LLMs) in translating high-stakes messages from Italian to English. The study is guided by the following research questions (RQs): RQ1: Do existing MT error taxonomies adequately reflect the key features of crisis communication? RQ2: Can we design an improved annotation framework that integrates decision-support tools and promotes consistency and reliability in crisis-related MT evaluation? The methodology involved corpus compilation, automatic translations using Google Translate and ChatGPT-4, and human quality assessment through manual annotation. An initial draft of annotation guidelines was produced to ensure a shared understanding and consistent application of error categories. Following a preliminary annotation phase, ambiguous cases were collected and resolved through structured discussions. To validate annotation reliability, we measured inter-annotator agreement (IAA) using established metrics in computational linguistics (Artstein and Poesio, 2008) such as Cohen’s Kappa (Cohen, 1960), Fleiss’ Kappa (Fleiss, 1971) and Krippendorff’s Alpha (Krippendorff, 2011) for multi-annotator agreement. Results showed strong agreement overall, though slightly higher consistency was noted for Google Translate (Fleiss’ κ = 0.82, Krippendorff’s α = 0.83) compared to ChatGPT (Fleiss’ κ = 0.78, Krippendorff’s α = 0.79). Pairwise analysis highlighted variations in agreement across annotators and MT outputs, revealing specific challenges in annotating different MT outputs. These findings emphasize the importance of rigorous annotation procedures and demonstrate that clarity in guidelines and structured decision-support tools significantly improve inter-annotator reliability. The outcomes offer valuable methodological insights for corpus annotation within crisis contexts, underscoring the need for domain-sensitive training and robust, well-defined annotation protocols.
Towards a reliable annotation framework for crisis MT evaluation: Addressing error taxonomies and annotator agreement
Staiano, Maria Carmen;Monti, Johanna;Chiusaroli, Francesca
2025-01-01
Abstract
Accurate human annotations are essential for evaluating the quality of machine translation (MT) outputs, particularly in sensitive contexts such as crisis communication. This paper presents a detailed analysis of the annotation process used in the ITALERT (Italian Emergency Response Text) corpus, specifically designed to evaluate the performance of neural machine translation (NMT) systems and large language models (LLMs) in translating high-stakes messages from Italian to English. The study is guided by the following research questions (RQs): RQ1: Do existing MT error taxonomies adequately reflect the key features of crisis communication? RQ2: Can we design an improved annotation framework that integrates decision-support tools and promotes consistency and reliability in crisis-related MT evaluation? The methodology involved corpus compilation, automatic translations using Google Translate and ChatGPT-4, and human quality assessment through manual annotation. An initial draft of annotation guidelines was produced to ensure a shared understanding and consistent application of error categories. Following a preliminary annotation phase, ambiguous cases were collected and resolved through structured discussions. To validate annotation reliability, we measured inter-annotator agreement (IAA) using established metrics in computational linguistics (Artstein and Poesio, 2008) such as Cohen’s Kappa (Cohen, 1960), Fleiss’ Kappa (Fleiss, 1971) and Krippendorff’s Alpha (Krippendorff, 2011) for multi-annotator agreement. Results showed strong agreement overall, though slightly higher consistency was noted for Google Translate (Fleiss’ κ = 0.82, Krippendorff’s α = 0.83) compared to ChatGPT (Fleiss’ κ = 0.78, Krippendorff’s α = 0.79). Pairwise analysis highlighted variations in agreement across annotators and MT outputs, revealing specific challenges in annotating different MT outputs. These findings emphasize the importance of rigorous annotation procedures and demonstrate that clarity in guidelines and structured decision-support tools significantly improve inter-annotator reliability. The outcomes offer valuable methodological insights for corpus annotation within crisis contexts, underscoring the need for domain-sensitive training and robust, well-defined annotation protocols.| File | Dimensione | Formato | |
|---|---|---|---|
|
CL2025 Book Of Abstracts_24th June.pdf
accesso aperto
Tipologia:
Altro materiale allegato (es. Copertina, Indice, Materiale supplementare, Abstract, Brevetti Spin-off, Start-up etc.)
Licenza:
Creative commons
Dimensione
5.31 MB
Formato
Adobe PDF
|
5.31 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


