Cross-lingual Textual Entailment (CLTE) Corpus
The Cross-lingual Textual Entailment (CLTE) Corpus is the result of a joint effort between FBK and CELCT.
Four CLTE corpora are available:
- Spanish/English (SPA-ENG)
- German/English (DEU-ENG)
- Italian/English (ITA-ENG)
- French/English (FRA-ENG)
Each Cross-lingual dataset consists of 1,000 CLTE pairs (500 for training and 500 for test), balanced with respect to the following four entailment judgments:
- Bidirectional (T1 –> T2 & T1 <– T2): the two fragments entail each other (semantic equivalence)
- Forward (T1 –> T2 & T1 !<– T2): unidirectional entailment from T1 to T2
- Backward (T1 !–> T2 & T1 <– T2): unidirectional entailment from T2 to T1
- No Entailment (T1 !–> T2 & T1 !<– T2): there is no entailment between T1 and T2
Both T1 and T2 are assumed to be TRUE statements; hence in the dataset there are no contradictory pairs.
Additionally, a monolingual English corpus was created as a by-product of the data collection methodology, consisting of 1,000 TE pairs balanced with respect to the four entailment judgments.
The CLTE corpus, created and used in the SemEval 2012 Cross-lingual Textual Entailment for Content Synchronization (CLTE) task, is the Training set for the CLTE task at SemEval 2013.
How to obtain it
The corpus is freely available
for research purposes upon acceptance of a license agreement
Contacts: Alessandro Marchetti, amarchetti[at]celct.it , Luisa Bentivogli, bentivo[at]fbk.it