Chengyu Dataset - Native Speaker Ratings

This dataset provides linguistic information and native speaker ratings for 1,015 chengyu 成语 'Chinese idiomatic expressions'. The data were compiled as part of PhD research (Davey, 2026b) investigating how first-language (L1) Chinese users understand and categorise chengyu .

Citation and access

How to cite this dataset

When using this dataset, please cite:

Davey, J. (2026a). Chengyu dataset: Information and native speaker ratings [Dataset]. OSF. https://doi.org/10.17605/OSF.IO/3TDA7

Primary research

This dataset was compiled as part of the following research:

Davey, J. (2026b). What makes a chengyu? Native speaker intuitions for categorising Chinese idiomatic expressions [Doctoral dissertation submitted for examination, Australian National University].

When reporting findings based on this dataset, please cite the dataset itself (Davey, 2026a). The thesis citation (Davey, 2026b) provides theoretical and methodological context for the research design and data collection.

Citing data subsets

When using specific subsets of data, please acknowledge the original sources in addition to citing this dataset. For example:

Original empirical data: Cite this dataset (Davey, 2026a) and Davey (2026b)
Published descriptive norms: Cite this dataset (Davey, 2026a) and the relevant original publication (Li et al., 2016; Zhang & Ji, 2016; Zheng, 2019; or Zheng et al., 2022)
Linguistic information: Cite this dataset (Davey, 2026a) and the relevant dictionary source (Jiao et al., 2011; Jnqz, 2025; or Zdic, 2025)

Source information for all variables is provided in the documentation .

Data availability

This dataset is made freely available for non-commercial research and educational purposes under a Creative Commons CC-BY-NC 4.0 Licence.

Permanent identifier : https://doi.org/10.17605/OSF.IO/3TDA7

OSF repository : Open Science Framework (OSF)

GitHub repository : https://github.com/chengyu-data/chengyu-dataset

Interactive browser : GitHub Pages

Format : Microsoft Excel (.xlsx) with 67 variables

Documentation : Full variable descriptions, measurement details, source information, and references

Dataset contents

Original empirical data (Davey, 2026b)

Native speaker ratings (n=196 respondents): Recognition, chengyu acceptability, compositionality, and character engagement ratings for 500 common expressions
Experimental acceptability judgements (n=59 participants): Timed binary chengyu acceptability decisions with reaction times for 96 expressions

Published descriptive norms

Psycholinguistic ratings from Li et al. (2016), Zhang & Ji (2016), Zheng (2019), and Zheng et al. (2022)
Familiarity, meaningfulness, compositionality, literality, predictability, and L2 knowledge ratings

Corpus frequency data

Raw and normalised frequencies from four major Chinese corpora:

BCC (9.5 billion characters)
CCL (4.75 billion characters)
zhTenTen17 (16.59 billion characters)
GigaWord 2 (250 million characters)

Linguistic information

Semantic information : Chinese explanations, English translations, synonyms, antonyms, sentiment
Structural features : Syntactic structure, structural symmetry, xingshi 形式 patterns (ABCD, AABB, etc.)
Etymology : Historical origins, dynastic periods, source texts
Character-level data : Individual character frequency ranks, stroke counts

Coverage

The 1,015 expressions in this dataset include the 500 expressions in Jiao et al.'s (2011) annotated frequency dictionary , plus 515 expressions from my PhD research and other published sources. Variable coverage ranges from complete (n=1,015) to focused subsets (n=24–500), reflecting integration of data from multiple independent studies with different research aims.

Repository contents

docs/

index.html - Project overview (you are here)
data-browser.html - Interactive searchable datatable
documentation.html - Documentation with variable descriptions

Research findings

Chengyu 成语 (literally 'set phrases') are conventionalised four-character expressions that often derive from classical Chinese texts, historical events, or stories. While Chinese dictionaries include tens of thousands of chengyu , native speakers often disagree about which expressions genuinely belong to this culturally salient category.

This PhD research (Davey, 2026b) provides empirical evidence that Chinese language users conceptualise chengyu differently than authoritative dictionaries suggest. Through three complementary studies with over 200 native Chinese speakers, the research reveals:

Dictionary classifications don't predict native speaker judgements : Dictionary-listed chengyu were rejected 40% of the time
Semantic opacity matters : Expressions with transparent, compositional meanings were less likely to be accepted as genuine chengyu
Structural irregularity is valued : Regular patterns like AABB reduplication (e.g., fengfengyuyu 风风雨雨) were consistently rejected as chengyu , while irregular ABCD patterns were preferred
Individual variation is substantial : Educational background and Chinese language and cultural knowledge significantly influenced the perceived scope of the chengyu category
Intuitive recognition is fast : Mean reaction time of 1,133 ms suggests chengyu categorisation is predominantly intuitive rather than analytical

The findings challenge assumptions that chengyu can be defined by formal linguistic criteria alone, instead revealing them as a culturally negotiated category with flexible but recognisable boundaries.

Intended uses

This dataset supports research in:

Linguistics : Phraseology, categorisation, semantic opacity, multiword expressions
Psycholinguistics : Native speaker intuitions, processing, acceptability judgements
Chinese pedagogy : Evidence-based approaches to teaching chengyu
Corpus linguistics : Frequency patterns, usage analysis
Computational linguistics : Chengyu identification, semantic analysis

Contact

Compiled by : Janet Davey

Email : main.gem4761@fastmail.com

Institution : Australian National University

For questions about this dataset or to report errors, please contact main.gem4761@fastmail.com