English dataset

We provide two datasets of naturally occurring examples of clausal embedding in English and an automated extraction tool. Please see the associated paper (Carslaw et al. 2025) for details of the methods and specific use cases.

Golden Embedded Clause Set (GECS) [Link to the dataset]: A dataset of clausal embedding examples extracted from the Dolma corpus (Soldaini et al. 2023) with fine-grained gold standard annotation. The dataset contains 147 declarative embedded clauses, 138 polar interrogative embedded clauses, 84 alternative interrogative embedded clauses, and 158 constituent interrogative embedded clauses. In addition, we provide a set of 111 adversarial examples.

Large-scale dataset [Link to the dataset]: A large-scale extracted set of English embedded clauses from the Dolma subset v1_6-sample. Currently contains 28,968,073 cases detected by our automated extraction tool.

Extraction tool [Link]: An automated extraction tool based on constituency parser and additional heuristics. The tool performs the following tasks:

  • Detection: detecting embedded clause(s) in a sentence
  • Predicate Identification: identifying each embedding predicate
  • Clause Identification: identifying the span of each embedded clause
  • Typing: identifying the type of each embedded clause

How to cite: Carslaw, Iona, Sivan Milton, Nicolas Navarre, Ciyang Qing & Wataru Uegaki. 2025. Automatic extraction of clausal embedding based on large-scale English text data. Proceedings of The Society for Computation in Linguistics. https://doi.org/10.48550/arXiv.2506.14064