Chengyu dataset: Information and native speaker ratings

Linguistic information and native speaker ratings for 1,015 chengyu 成语

This dataset provides linguistic information and native speaker ratings for 1,015 chengyu 成语 'Chinese idiomatic expressions'. The data were compiled as part of PhD research (Davey, 2026b) investigating how first-language (L1) Chinese users understand and categorise chengyu .

Citation and access

How to cite this dataset

When using this dataset, please cite:

Davey, J. (2026a). Chengyu dataset: Information and native speaker ratings [Dataset]. OSF. https://doi.org/10.17605/OSF.IO/3TDA7

Primary research

This dataset was compiled as part of the following research:

Davey, J. (2026b). What makes a chengyu? Native speaker intuitions for categorising Chinese idiomatic expressions [Doctoral dissertation submitted for examination, Australian National University].

When reporting findings based on this dataset, please cite the dataset itself (Davey, 2026a). The thesis citation (Davey, 2026b) provides theoretical and methodological context for the research design and data collection.

Citing data subsets

When using specific subsets of data, please acknowledge the original sources in addition to citing this dataset. For example:

Source information for all variables is provided in the documentation .

Data availability

This dataset is made freely available for non-commercial research and educational purposes under a Creative Commons CC-BY-NC 4.0 Licence.

Permanent identifier : https://doi.org/10.17605/OSF.IO/3TDA7

OSF repository : Open Science Framework (OSF)

GitHub repository : https://github.com/chengyu-data/chengyu-dataset

Interactive browser : GitHub Pages

Format : Microsoft Excel (.xlsx) with 67 variables

Documentation : Full variable descriptions, measurement details, source information, and references

Dataset contents

Original empirical data (Davey, 2026b)

Published descriptive norms

Corpus frequency data

Linguistic information

Coverage

The 1,015 expressions in this dataset include the 500 expressions in Jiao et al.'s (2011) annotated frequency dictionary , plus 515 expressions from my PhD research and other published sources. Variable coverage ranges from complete (n=1,015) to focused subsets (n=24–500), reflecting integration of data from multiple independent studies with different research aims.

Repository contents

Research findings

Chengyu 成语 (literally 'set phrases') are conventionalised four-character expressions that often derive from classical Chinese texts, historical events, or stories. While Chinese dictionaries include tens of thousands of chengyu , native speakers often disagree about which expressions genuinely belong to this culturally salient category.

This PhD research (Davey, 2026b) provides empirical evidence that Chinese language users conceptualise chengyu differently than authoritative dictionaries suggest. Through three complementary studies with over 200 native Chinese speakers, the research reveals:

The findings challenge assumptions that chengyu can be defined by formal linguistic criteria alone, instead revealing them as a culturally negotiated category with flexible but recognisable boundaries.

Intended uses

This dataset supports research in:

Contact

Compiled by : Janet Davey

Email : main.gem4761@fastmail.com

Institution : Australian National University

For questions about this dataset or to report errors, please contact main.gem4761@fastmail.com