Linguistic information and native speaker ratings for 1,015 chengyu 成语
This dataset provides linguistic information and native speaker ratings for 1,015 chengyu 成语 'Chinese idiomatic expressions'. The data were compiled as part of PhD research (Davey, 2026b) investigating how first-language (L1) Chinese users understand and categorise chengyu .
When using this dataset, please cite:
Davey, J. (2026a). Chengyu dataset: Information and native speaker ratings [Dataset]. OSF. https://doi.org/10.17605/OSF.IO/3TDA7
This dataset was compiled as part of the following research:
Davey, J. (2026b). What makes a chengyu? Native speaker intuitions for categorising Chinese idiomatic expressions [Doctoral dissertation submitted for examination, Australian National University].
When reporting findings based on this dataset, please cite the dataset itself (Davey, 2026a). The thesis citation (Davey, 2026b) provides theoretical and methodological context for the research design and data collection.
When using specific subsets of data, please acknowledge the original sources in addition to citing this dataset. For example:
Source information for all variables is provided in the documentation .
This dataset is made freely available for non-commercial research and educational purposes under a Creative Commons CC-BY-NC 4.0 Licence.
Permanent identifier : https://doi.org/10.17605/OSF.IO/3TDA7
OSF repository : Open Science Framework (OSF)
GitHub repository : https://github.com/chengyu-data/chengyu-dataset
Interactive browser : GitHub Pages
Format : Microsoft Excel (.xlsx) with 67 variables
Documentation : Full variable descriptions, measurement details, source information, and references
The 1,015 expressions in this dataset include the 500 expressions in Jiao et al.'s (2011) annotated frequency dictionary , plus 515 expressions from my PhD research and other published sources. Variable coverage ranges from complete (n=1,015) to focused subsets (n=24–500), reflecting integration of data from multiple independent studies with different research aims.
docs/
index.html
- Project overview (you are here)
data-browser.html
- Interactive searchable datatable
documentation.html
- Documentation with variable descriptions
Chengyu 成语 (literally 'set phrases') are conventionalised four-character expressions that often derive from classical Chinese texts, historical events, or stories. While Chinese dictionaries include tens of thousands of chengyu , native speakers often disagree about which expressions genuinely belong to this culturally salient category.
This PhD research (Davey, 2026b) provides empirical evidence that Chinese language users conceptualise chengyu differently than authoritative dictionaries suggest. Through three complementary studies with over 200 native Chinese speakers, the research reveals:
The findings challenge assumptions that chengyu can be defined by formal linguistic criteria alone, instead revealing them as a culturally negotiated category with flexible but recognisable boundaries.
This dataset supports research in:
Compiled by : Janet Davey
Email : main.gem4761@fastmail.com
Institution : Australian National University
For questions about this dataset or to report errors, please contact main.gem4761@fastmail.com