Documentation - Chengyu Dataset

Chengyu dataset: Information and native speaker ratings

Overview and citation information

This dataset collates information about 1,015 chengyu 成语 'Chinese idiomatic expressions', including expression-level information and native speaker ratings. The dataset was compiled as part of my PhD research (Davey, 2026b) investigating how first-language (L1) Chinese language users understand chengyu as a category.

Data is collated from multiple sources:

Original native speaker ratings from 196 L1 Chinese respondents: RC 2 in Davey (2026b)
Original experimental data from 59 L1 Chinese participants: RC 3 in Davey (2026b)
Published descriptive norms from Li et al. (2016), Zhang & Ji (2016), Zheng (2019), and Zheng et al. (2022)
Frequency data from four major Chinese corpora (BCC, CCL, zhTenTen17, GigaWord 2)
Structural and semantic information from dictionaries, including Jiao et al. (2011), 成语词典 (Jnqz, 2025) and 汉典 (Zdic, 2025)

How to cite this dataset

When using this dataset, please cite:

Davey, J. (2026a). Chengyu dataset: Information and native speaker ratings [Dataset]. OSF. https://doi.org/10.17605/OSF.IO/3TDA7

Primary research

This dataset was compiled as part of the following research:

Davey, J. (2026b). What makes a chengyu? Native speaker intuitions for categorising Chinese idiomatic expressions [Doctoral dissertation submitted for examination, Australian National University].

When reporting findings based on this dataset, please cite the dataset itself (above). The thesis citation provides theoretical and methodological context for the research design and data collection.

Citing data subsets

When using specific subsets of data from this dataset, please acknowledge the original sources in addition to citing this dataset. For example:

Original empirical data: Cite this dataset (Davey, 2026a) and Davey (2026b)
Published descriptive norms: Cite this dataset (Davey, 2026a) and the relevant original publication (Li et al., 2016; Zhang & Ji, 2016; Zheng, 2019; or Zheng et al., 2022)
Linguistic information: Cite this dataset (Davey, 2026a) and the relevant dictionary source (Jiao et al., 2011; Jnqz, 2025; or Zdic, 2025)

Source information for all variables is provided below.

Data availability : This dataset is made freely available for non-commercial research and educational purposes, licensed under a Creative Commons CC-BY-NC 4.0 Licence.

Coverage : The dataset contains information about 1,015 chengyu , including the 500 expressions in Jiao et al. (2011) plus another 515 expressions from my PhD research (Davey, 2026b) and other published sources. Some measures are available for all expressions, while others cover specific subsets as detailed below.

Notes on partial data coverage

The dataset includes variables with varying coverage across the 1,015 expressions. This reflects the integration of data from multiple independent studies with different research aims and sampling strategies. Partial coverage does not indicate missing data but rather the scope of each original study.

Understanding coverage patterns

Complete coverage (n=1,015) : Core identifiers, frequency measures from four corpora, xingshi patterns, constituent characters and their frequency ranks
Near complete coverage (n > 977) : Jnqz dictionary information (explanations, etymology, synonyms, antonyms, examples, usage notes, syntactic structure, sentiment, commonality indicator)
Substantial coverage (n=500) : RC 2 ratings, Jiao et al. (2011) information
Published norms (n=237-350) : Zheng (2019), Zhang & Ji (2016), Li et al. (2016) measures
Focused subsets (n=24-48) : RC 3 experimental stimuli, Zheng et al. (2022) measures

Value of partial data

Partial data serves multiple purposes:

Enrichment : Even when available for only some expressions, additional measures provide valuable information for those items (e.g., etymology, example sentences, L2 familiarity)
Triangulation : Multiple operationalisations of similar constructs (e.g., compositionality from RC 2, Zheng 2019, and Zhang & Ji 2016) allow comparison across methods and populations
Gap identification : Absent measures for certain expressions highlight opportunities for future research
Methodological diversity : Different measurement approaches (rating scales, completion tasks, binary judgements, reaction times) provide converging evidence through complementary methodologies

Using partial data

When using this dataset:

Check coverage for your variables of interest
Document which subset you analyse
Consider whether missing data is random or systematic (e.g., RC 3 stimuli were selected to vary on specific dimensions)
Cite both this dataset and the original sources for any measures you use

Version information

Dataset version : 1.0

Last updated : 19/01/2026

Compiled by : Janet Davey

Contact : main.gem4761@fastmail.com

For questions about this dataset or to report errors, please contact main.gem4761@fastmail.com.

Expression information variables

Core identifiers

Expression

Variable name : expression

Source : Multiple

Definition : The chengyu in simplified Chinese characters

Coverage : n=1,015

Pinyin

Variable name : pinyin

Source : Multiple

Definition : Romanisation of the expression using Hanyu Pinyin

Coverage : n=1,015

Semantic information

Explanation (Zdic)

Variable name : explanation_zdic

Source : Zdic 汉典 (Zdic, 2025)

Definition : Explanation of the expression's meaning in Chinese

Coverage : n=1,015

Explanation (Jnqz)

Variable name : explanation_jnqz

Source : 成语词典 (Jnqz, 2025)

Definition : Explanation of the expression's meaning in Chinese

Coverage : n=1,008

Translation (Jiao et al.)

Variable name : translation_jiao

Source : Jiao et al. (2011)

Definition : English translation

Coverage : n=500

Translation (Pleco)

Variable name : translation_pleco

Source : Pleco Chinese Dictionary

Definition : English translation or gloss

Coverage : n=1,015

Near synonyms (Zdic)

Variable name : near_synonyms_zdic

Source : Zdic 汉典 (Zdic, 2025)

Definition : Expressions with similar meanings

Coverage : n=881

Near synonyms (Jiao et al.)

Variable name : near_synonyms_jiao

Source : Jiao et al. (2011)

Definition : Expressions with similar meanings

Coverage : n=461

Near synonyms (Jnqz)

Variable name : near_synonyms_jnqz

Source : 成语词典 (Jnqz, 2025)

Definition : Expressions with similar meanings

Coverage : n=998

Antonyms (Jnqz)

Variable name : antonyms_jnqz

Source : 成语词典 (Jnqz, 2025)

Definition : Expressions with opposite meanings

Coverage : n=976

Sentiment (Jnqz)

Variable name : sentiment_jnqz

Source : 成语词典 (Jnqz, 2025)

Definition : Emotional colouring or sentiment of the expression

Categories : 褒义成语 (commendatory/positive), 中性成语 (neutral), 贬义成语 (derogatory/negative)

Coverage : n=1,004

Structural and syntactic information

Syntactic structure (Zdic)

Variable name : syntactic_structure_zdic

Source : Zdic 汉典 (Zdic, 2025)

Definition : Syntactic structure classification using Chinese grammatical categories

Categories : 联合式 (coordinate structure), 主谓式 (subject-predicate structure), 偏正式 (modifier-modified structure), 动宾式 (verb-object structure), 连动式 (serial verb structure), 补充式 (complement structure), 紧缩式 (abbreviated structure), 复句式 (compound structure), 复杂式 (complex structure)

Coverage : n=901

Syntactic structure (Jnqz)

Variable name : syntactic_structure_jnqz

Source : 成语词典 (Jnqz, 2025)

Definition : Syntactic structure classification using Chinese grammatical categories

Coverage : n=1,005

Note : The dataset includes syntactic structure classifications from both 成语词典 (Jnqz, 2025) and the Zdic 汉典 online dictionary (Zdic, 2025) because syntactic structure categorisation for chengyu is not universally standardised across reference works, and where sources disagree, users can make informed decisions based on their own analytical frameworks.

Structure (Li et al. 2016)

Variable name : structure_li2016

Source : Li et al. (2016)

Definition : Syntactic structure classification following Li et al.'s (2016) categories

Categories : VO (verb-object), SM (structure of modification), SV (subject-predicate), VV (verb-verb), VOVO, SMSM, SVSV (indicating both pairs of characters follow the same pattern)

Coverage : n=350

Structural symmetry

Variable name : symmetric_structure

Source : Coded by researcher following Liu and Xing (2000) and Liu and Cheung (2014)

Definition : Whether the expression exhibits structural symmetry ( duichenxing 对称性)

Categories : Binary (symmetric vs. asymmetric)

Note : Categorised as structurally symmetric when the first two characters and last two characters have identical syntactic constructions with corresponding word classes. This structural symmetry criterion is broader than syntactic parallelism ( binglieshi 并列式) alone, encompassing expressions with various semantic relationships (temporal, causal, or coordinative) provided the two halves are equally weighted.

Coverage : n=1,015

Xingshi pattern

Variable name : xingshi

Source : 成语词典 (Jnqz, 2025)

Definition : Structural pattern ( xingshi 形式) based on character repetition

Categories : ABCD, AABB, AABC, ABAC, ABBC, ABCA, ABCB, ABCC

Coverage : n=1,015

Usage (Zdic)

Variable name : usage_zdic

Source : Zdic 汉典 (Zdic, 2025)

Definition : Usage guidance including grammatical function and pragmatic information about the expression

Coverage : n=901

Usage (Jiao et al.)

Variable name : usage_jiao

Source : Jiao et al. (2011)

Definition : Usage guidance including grammatical function and pragmatic information about the expression

Coverage : n=500

Usage (Jnqz)

Variable name : usage_jnqz

Source : 成语词典 (Jnqz, 2025)

Definition : Usage guidance including grammatical function and pragmatic information about the expression

Coverage : n=1,004

Examples (Zdic)

Variable name : examples_zdic

Source : Zdic 汉典 (Zdic, 2025)

Definition : Example sentences demonstrating usage

Coverage : n=852

Examples (Jnqz)

Variable name : examples_jnqz

Source : 成语词典 (Jnqz, 2025)

Definition : Example sentences demonstrating usage

Coverage : n=1,005

Etymology

Etymology (Zdic)

Variable name : etymology_zdic

Source : Zdic 汉典 (Zdic, 2025)

Definition : Historical origin or etymological information about the expression, including source texts where applicable

Coverage : n=947

Etymology (Jnqz)

Variable name : etymology_jnqz

Source : 成语词典 (Jnqz, 2025)

Definition : Historical origin or etymological information about the expression, including source texts where applicable

Coverage : n=1,005

Dynasty

Variable name : dynasty

Source : Coded by researcher based on etymology

Definition : Dynastic period when the expression is thought to have originated. When the etymological information between 'etymology_zdic' and 'etymology_jnqz' conflicts, then the earliest attested period is used. Expressions from Buddhist texts (e.g., sutras) are coded as 'Buddhist', while those from foreign texts (e.g., Aesop's Fables, Arabian Nights) are coded as 'Foreign'. Left blank if unknown

Coverage : n=1,005

Historicity

Variable name : historicity

Source : Coded by researcher based on dynasty and etymology

Definition : Categorisation of expressions into historical eras

Categories : Ancient (–221 BCE; pre-Qin), Early Imperial (221 BCE–589 CE; Qin, Han, Three Kingdoms, Jin, Northern and Southern Dynasties), Middle Imperial (589–1368 CE; Sui, Tang, Five Dynasties and Ten Kingdoms, Song), Late Imperial (1368–1912 CE; Yuan, Ming, Qing), Modern (1912– )

Coverage : n=1,005

Frequency measures

BCC raw frequency

Variable name : freq_bcc

Source : BCC (Beijing Language and Culture University Chinese Corpus)

Definition : Raw token frequency in BCC

Corpus details : 9.5 billion characters; news, social media, literature, technology, blog (compiled 2019)

Note : Frequency data extracted from BCC Global wordlist (combined subcorpora)

Reference : Xun et al. (2015, 2016)

Coverage : n=1,015

BCC frequency per million characters

Variable name : freq_bcc_per_million

Source : Calculated from freq_bcc

Definition : Normalised frequency per million characters in BCC

Calculation : (freq_bcc / 9,500,000,000) × 1,000,000

Coverage : n=1,015

BCC frequency level

Variable name : freq_level_bcc

Source : Calculated from freq_bcc

Definition : Classification of high, medium, or low frequency based on tertile thresholds (33rd and 66th percentiles). See freq_level

Coverage : n=1,015

CCL raw frequency

Variable name : freq_ccl

Source : CCL (Peking University Center for Chinese Linguistics)

Definition : Raw token frequency in the CCL modern Chinese corpus

Corpus details : 4.75 billion characters (modern Chinese portion of 2025 version); news, internet texts, literature, biographies, TV/movies, translations

Note : Frequency data obtained through batch query of CCL corpus interface

Reference : Zhan et al. (2019)

Coverage : n=1,015

CCL frequency per million characters

Variable name : freq_ccl_per_million

Source : Calculated from freq_ccl

Definition : Normalised frequency per million characters in CCL modern Chinese

Calculation : (freq_ccl / 4,746,907,429) × 1,000,000

Coverage : n=1,015

CCL frequency level

Variable name : freq_level_ccl

Source : Calculated from freq_ccl

Definition : Classification of high, medium, or low frequency based on tertile thresholds (33rd and 66th percentiles). See freq_level

Coverage : n=1,015

zhTenTen17 raw frequency

Variable name : freq_zhtenten

Source : zhTenTen17 (Sketch Engine, 2025b)

Definition : Raw token frequency in the zhTenTen17 corpus

Corpus details : 16.59 billion characters; simplified Chinese web texts covering diverse genres (compiled August and November-December 2017)

Note : Frequency data obtained through Sketch Engine batch query

Coverage : n=1,015

zhTenTen17 frequency per million characters

Variable name : freq_zhtenten_per_million

Source : Calculated from freq_zhtenten

Definition : Normalised frequency per million characters in zhTenTen17

Calculation : (freq_zhtenten / 16,593,146,196) × 1,000,000

Coverage : n=1,015

zhTenTen17 frequency level

Variable name : freq_level_zhtenten

Source : Calculated from freq_zhtenten

Definition : Classification of high, medium, or low frequency based on tertile thresholds (33rd and 66th percentiles). See freq_level

Coverage : n=1,015

GigaWord2 raw frequency

Variable name : freq_gigaword2

Source : GigaWord 2 (Sketch Engine, 2025a)

Definition : Raw token frequency in the GigaWord 2 corpus

Corpus details : 250.12 million characters; newswire texts (compiled 2005)

Note : Frequency data obtained through Sketch Engine batch query

Coverage : n=1,015

GigaWord2 frequency per million characters

Variable name : freq_gigaword2_per_million

Source : Calculated from freq_gigaword2

Definition : Normalised frequency per million characters in GigaWord 2

Calculation : (freq_gigaword2 / 250,124,230) × 1,000,000

Coverage : n=1,015

GigaWord2 frequency level

Variable name : freq_level_gigaword2

Source : Calculated from freq_gigaword2

Definition : Classification of high, medium, or low frequency based on tertile thresholds (33rd and 66th percentiles). See freq_level

Coverage : n=1,015

Overall frequency level

Variable name : freq_level

Source : Coded by researcher based on composite corpus frequencies

Definition : Tripartite classification of expressions as high, medium or low frequency based on overall occurrence across four Chinese corpora: BCC, CCL, zhTenTen17, and GigaWord2

Measurement : For each corpus, tertile thresholds (33rd and 66th percentiles) were calculated using the per-million-token frequencies of all expressions in the dataset. The tertile values were: BCC (33rd = 1.020, 66th = 2.935), CCL (33rd = 0.254, 66th = 0.791), zhTenTen17 (33rd = 0.255, 66th = 1.064), and GigaWord2 (33rd = 0.088, 66th = 0.793). Each expression was then classified as high, medium, or low frequency within each individual corpus based on these tertiles. The final freq_level value represents the modal category across all four corpora (i.e., whichever frequency level appeared most frequently across the four classifications). In cases where there was a tie between categories, a conservative approach was adopted: ties between high and medium defaulted to medium, ties between medium and low defaulted to medium, and ties between high and low (with no medium classifications) also defaulted to medium.

Note : This composite measure aims to account for substantial variation in chengyu prevalence and text types across different corpora while providing an interpretable summary measure. The three-level classification reflects the fact that this dataset is biased towards more recognisable and learner-relevant chengyu suitable for research; a binary high/low classification risked mislabelling genuinely common expressions as 'low frequency' simply because they fell below the median of an already relatively high-frequency set. The tertile-based approach ensures that 'low frequency' is reserved for expressions that are genuinely rare within this collection.

Coverage : n=1,015

Commonality (Jnqz)

Variable name : commonality_jnqz

Source : 成语词典 (Jnqz, 2025)

Definition : Commonality or frequency of use indicator

Categories : 常用成语 (common chengyu ), 一般成语 (general/ordinary chengyu )

Coverage : n=1,000

Frequency log10 (Zheng et al. 2022)

Variable name : freq_log10_zheng2022

Source : Zheng et al. (2022)

Definition : Log₁₀-transformed page count from Baidu.com

Measurement : "We adopted Libben and Titone's (2008) method using the most popular Chinese website search engine [Baidu] as the dataset, and we employed the log-transformed page count to represent the whole-form frequency" (Zheng et al., 2022, pp. 5-6)

Coverage : n=24

Character-level information

Character 1, 2, 3, 4

Variable name : char1 , char2 , char3 , char4

Source : Derived from expression

Definition : The four individual characters comprising the expression

Coverage : n=1,015

Character frequency ranks

Variable name : char1_freq_rank , char2_freq_rank , char3_freq_rank , char4_freq_rank

Source : Da (2004)

Definition : Frequency rank of each constituent character based on Jun Da's character frequency list

Coverage : n=1,015 (char3_freq_rank: n=1,014 with no frequency information for 俱)

Total stroke number (Zheng et al. 2022)

Variable name : stroke_total_zheng2022

Source : Zheng et al. (2022)

Definition : Total number of strokes across all four characters

Coverage : n=24

Native speaker rating variables

RC 2 ratings (n=196 respondents)

All RC 2 measures used 5-point Likert-type scales administered via online questionnaire. Respondents were L1 Chinese adults (18+ years) who had at least middle school education and were familiar with the term chengyu . Each respondent rated 25 randomly selected expressions from the 500-item pool.

Recognition rating (RC 2)

Variable name : recognition_mean_rc2

Source : RC 2 (original data)

Definition : How well respondents know and understand the expression

Measurement : 5-point scale: 1=不知道,不理解 (don't know, don't understand) to 5=知道,理解 (know, understand)

Coverage : n=500

Chengyu acceptability rating (RC 2)

Variable name : chengyu_acceptability_mean_rc2

Source : RC 2 (original data)

Definition : Whether respondents consider the expression to be a chengyu

Measurement : 5-point Likert scale: 1=肯定不是 (definitely not a chengyu ) to 5=肯定是 (definitely a chengyu )

Coverage : n=500

Literal compositionality rating (RC 2)

Variable name : compositionality_literal_mean_rc2

Source : RC 2 (original data)

Definition : The likelihood that someone unfamiliar with the expression but knowing the individual characters could derive its literal meaning ( zimian yisi 字面意思)

Measurement : 5-point scale: 1=极不可能 (extremely unlikely) to 5=极有可能 (extremely likely)

Coverage : n=500

Semantic compositionality rating (RC 2)

Variable name : compositionality_semantic_mean_rc2

Source : RC 2 (original data)

Definition : The likelihood that someone unfamiliar with the expression but knowing the individual characters could derive its overall meaning ( zhengti hanyi 整体含义)

Measurement : 5-point scale: 1=极不可能 (extremely unlikely) to 5=极有可能 (extremely likely)

Coverage : n=500

Character engagement rating (RC 2)

Variable name : character_engagement_mean_rc2

Source : RC 2 (original data)

Definition : Whether respondents attend to the constituent characters when reading the expression and/or consider the constituent characters useful for meaning recall

Measurement : 5-point scale: 1=不注意,想不起含义 (do not attend to characters, cannot recall meaning) to 5=注意,很有帮助 (attend to characters, very useful for meaning recall)

Coverage : n=500

RC 3 experimental measures (n=59 participants)

RC 3 employed a timed binary judgement task where participants made rapid decisions about whether expressions were chengyu or not. Participants were L1 Chinese adults (18+ years) who had at least middle school education and were familiar with the term chengyu . Each participant judged 96 expressions (48 dictionary-listed chengyu and 48 non- chengyu ) using keyboard responses.

Chengyu acceptability proportion (RC 3)

Variable name : chengyu_acceptability_prop_rc3

Source : RC 3 (original data)

Definition : Proportion of participants who judged the expression as a chengyu

Measurement : Keyboard response (A or L key, counterbalanced across participants). Participants with even subject numbers: A=is not chengyu , L=is chengyu ; odd subject numbers: reversed mapping

Note : Task instructions emphasised responding as quickly and accurately as possible to capture intuitive judgements

Coverage : n=48 (subset of dictionary-listed chengyu )

Dictionary agreement proportion (RC 3)

Variable name : dictionary_agreement_prop_rc3

Source : RC 3 (derived variable)

Definition : Proportion of participants whose chengyu judgement matched the authoritative dictionary classification

Measurement : Binary (agrees/disagrees) for each participant, calculated post-hoc by comparing participant judgement to dictionary status, then aggregated as proportion

Coverage : n=48

Reaction time (RC 3)

Variable name : reaction_time_ms_rc3

Source : RC 3 (original data)

Definition : Mean time from expression presentation to keyboard response

Measurement : Milliseconds. Overall mean RT=1,133ms, median RT=914ms. Outlier trials (>4,000ms or >3 SD from participant mean) excluded

Note : Fast RTs suggest predominantly intuitive rather than deliberative processing

Coverage : n=48 expressions (averaged across 59 participants = 5,385 valid trials after outlier removal)

Reaction time log (RC 3)

Variable name : reaction_time_log_rc3

Source : RC 3 (derived)

Definition : Log-transformed reaction times

Coverage : n=48

Compositionality (RC 3)

Variable name : compositionality_rc3

Source : Coded by researcher

Definition : Whether the expression's meaning can be derived from its constituent characters

Categories : Binary (compositional vs noncompositional)

Note : Independent variable (stimulus characteristic) coded by researcher, not rated by participants

Coverage : n=48 (RC 3 stimuli)

Published descriptive norms

Familiarity rating (Zheng 2019)

Variable name : familiarity_zheng2019

Source : Zheng (2019)

Definition : How often speakers encounter an idiom based on personal experience

Measurement : 5-point Likert scale: 1=never heard, read, or produced; 5=heard, read, or produced very often

Coverage : n=237

Meaningfulness rating (Zheng 2019)

Variable name : meaningfulness_zheng2019

Source : Zheng (2019)

Definition : How well speakers believe they know the figurative meaning

Measurement : 5-point scale: 1=absolutely no idea what the idiom means; 5=100% certain of meaning and could explain explicitly

Coverage : n=237

Compositionality rating (Zheng 2019)

Variable name : compositionality_zheng2019

Source : Zheng (2019)

Definition : Extent to which literal meanings of constituent characters relate to overall figurative meaning

Measurement : 5-point scale: 1=absolutely not decomposable; 5=completely decomposable. Participants received figurative definition alongside each item and rated whether constituent parts contribute to expression meaning

Note : Adopted from Bonin, Méot, & Bugaiska (2013) approach

Coverage : n=237

Compositionality mean (Zhang & Ji 2016)

Variable name : compositionality_mean_zhangji2016

Source : Zhang & Ji (2016)

Definition : Mean rating of degree to which idiomatic meaning can be inferred from literal interpretation

Measurement : 5-point scale: 1=not related; 5=very related. Forty informants assessed semantic relevance between refined literal interpretations and commonly-used idiomatic meanings

Coverage : n=146

Literality rating (Zheng 2019)

Variable name : literality_zheng2019

Source : Zheng (2019)

Definition : Possibility that an idiom could be used literally in the real world

Measurement : 5-point scale: 1=absolutely not plausible; 5=completely plausible

Coverage : n=237

Predictability rating (Zheng 2019)

Variable name : predictability_zheng2019

Source : Zheng (2019)

Definition : Likelihood that speakers complete an idiom fragment idiomatically

Measurement : Proportion of participants who completed three-character fragment (with last character missing) with the expected final character to form the idiom

Note : Participants typed first word coming to mind to make phrase grammatical and meaningful

Coverage : n=237

Register rating (Zheng 2019)

Variable name : register_zheng2019

Source : Zheng (2019)

Definition : Which language register (written/formal or spoken/informal) the idiom belongs to

Measurement : Binary: 1=written (formal) language; 0=spoken (informal) language

Coverage : n=237

L2 knowledge rating (Zheng et al. 2022)

Variable name : l2_knowledge_rating_zheng2022

Source : Zheng et al. (2022)

Definition : Likelihood that advanced L2 Chinese learners would know the idiom

Measurement : 5-point scale rated by 33 experienced Chinese language teachers

Coverage : n=24

Frequency corpus details

Raw frequency data comes from four major Chinese corpora. Each corpus has different strengths and coverage patterns, which is why normalised frequency measures (per million characters) are provided to enable direct comparison. The composite frequency classification (for RC 3) used multiple sources to ensure robust categorisation.

BCC (BLCU Chinese Corpus)

Size : 9.5 billion characters (Global wordlist combines: news 2 billion, literature 3 billion, comprehensive 1.9 billion, dialogue 600 million, plus classical Chinese and technology subcorpora)
Content : News, social media, literature, technology, blog
Year compiled : 2019
Reference : Xun et al. (2015, 2016)
URL : http://bcc.blcu.edu.cn

CCL (Peking University Center for Chinese Linguistics)

Size : 4.75 billion characters (modern Chinese portion of 2024 version; total corpus including classical Chinese and bilingual texts: 6 billion characters)
Content : News, internet texts, literature, biographies, TV/movies, translations
Year compiled : 2024 version (earlier versions: 2003-2014)
Reference : Zhan et al. (2019)
URL : http://ccl.pku.edu.cn:8080/ccl_corpus

zhTenTen17 (Chinese Web 2017)

Size : 16.59 billion tokens (13.53 billion words segmented, 667 million sentences, 40.23 million documents)
Content : Simplified Chinese web texts covering diverse genres, topics, text types and sources
Year compiled : August and November-December 2017
Source : Jakubíček et al. (2013), Sketch Engine (2025b)
URL : https://www.sketchengine.eu/zhtenten-chinese-corpus/

GigaWord 2 (Chinese Simplified Gigaword 2)

Size : 250.12 million tokens (205 million words segmented, 10.61 million sentences, 817,348 documents)
Content : Newswire texts (96% stories, 4% other document types)
Year compiled : 2005
Source : Graff and Chen (2003), Sketch Engine (2025a)
URL : https://catalog.ldc.upenn.edu/LDC2005T14

Comparison and usage notes

The four corpora differ substantially in:

Size : zhTenTen17 and BCC are very large (>9 billion characters); CCL is moderate (4.75 billion); GigaWord 2 is relatively small (250 million)
Text types : BCC and CCL include diverse genres; zhTenTen17 focuses on web texts; GigaWord 2 is newswire only
Collection period : Spans from 2002 (GigaWord 2) to 2019 (BCC), capturing language change over time
Coverage of chengyu : Larger corpora generally have better coverage of lower-frequency chengyu , but all four corpora have gaps

For these reasons, the normalised frequencies (per million characters) enable meaningful comparison across corpora, and the composite frequency classification used for RC 3 provides a more robust measure than any single corpus.

References

"Chengyu Da Cidian" editorial board 《成语大词典》编委会 (Ed.). (2020). Chengyu da cidian (caise ben) 成语大词典 (彩色本) [ Great dictionary of chengyu (colour version) ] (Xin xiuding ban 新修订版 [Rev. ed.]). Shangwu Yinshuguan 商务印书馆 [Commercial Press].

Da, J. (2004). A corpus-based study of character and bigram frequencies in Chinese e-texts and its implications for Chinese language instruction. In P. Zhang, T. Xie, & J. Xu (Eds.), The studies on the theory and methodology of the digitalized Chinese teaching to foreigners: Proceedings of the fourth international conference on new technologies in teaching and learning Chinese (pp. 501–511). Tsinghua University Press. https://lingua.mtsu.edu/academic/dajun-4thtech.pdf

Davey, J. (2026a). Chengyu dataset: Native speaker ratings and linguistic information [Dataset]. OSF. https://doi.org/10.17605/OSF.IO/3TDA7

Davey, J. (2026b). What makes a chengyu? Native speaker intuitions for categorising Chinese idiomatic expressions [Doctoral dissertation submitted for examination, Australian National University].

Dictionary Editing Office, Institute of Linguistics, Chinese Academy of Social Sciences 中国社会科学院语言研究所词典编辑室. (2014). Xiandai Hanyu cidian 现代汉语词典 [ Contemporary Chinese dictionary ] (6th ed.). Shangwu Yinshuguan 商务印书馆 [Commercial Press].

Graff, D., & Chen, K. (2003). Chinese Gigaword LDC2003T09. Web Download. Linguistic Data Consortium. https://doi.org/10.35111/n069-0642

Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125–127). https://www.sketchengine.eu/wp-content/uploads/The_TenTen_Corpus_2013.pdf

Jiao, L., Kubler, C. C., & Zhang, W. (2011). 500 common Chinese idioms: An annotated frequency dictionary . Routledge.

Jnqz. (2025). Chengyu cidian 成语词典 [ Chengyu dictionary ]. Jinan Di Qi Zhongxue Chengyu Wang 济南第七中学成语网 [Jinan No. 7 Middle School chengyu website]. https://www.jnqz.cn/

Li, D., Zhang, Y., & Wang, X. (2016). Descriptive norms for 350 Chinese idioms with seven syntactic structures. Behavior Research Methods , 48 (4), 1678–1693. https://doi.org/10.3758/s13428-015-0692-y

Liu, L., & Cheung, H. T. (2014). Acquisition of Chinese quadra-syllabic idiomatic expressions: Effects of semantic opacity and structural symmetry. First Language , 34 (4), 336–353. https://doi.org/10.1177/0142723714544409

Liu Z. 刘振前, & Xing M. 邢梅萍. (2000). Hanyu sizige chengyu yuyi jiegou de duichenxing he renzhi 汉语四字格成语语义结构的对称性和认知 [The semantic symmetrical features of four-character chengyu in Chinese and their effects on cognition]. Shijie Hanyu Jiaoxue 世界汉语教学 [ Chinese Teaching in the World ], 1 , 77–81.

Pleco Inc. (2025). Pleco Chinese dictionary (Version 3.2.76 Mobile app) [Computer software]. Apple App Store. https://apps.apple.com/us/app/pleco-chinese-dictionary/

Sketch Engine. (2025a). Chinese Gigaword corpus . https://www.sketchengine.eu/chinese-gigaword/

Sketch Engine. (2025b). ZhTenTen – Chinese corpus from the web . https://www.sketchengine.eu/zhtenten-chinese-corpus

Xun, E. 荀恩东, Rao, G. 饶高琦, Xiao, X. 肖晓悦, & Zang, J. 臧娇娇. (2016). Dashuju Beijingxia BCC yuliaoku de yanzhi 大数据背景下BCC语料库的研制 [The construction of the BCC corpus in the age of big data]. Yuliaoku Yuyanxue 语料库语言学 [ Corpus Linguistics ], 3 (1), 93–118.

Xun, E. 荀恩东, Rao, G. 饶高琦, Xie, J. 谢佳丽, & Huang, Z. 黄志斌. (2015). Xiandai Hanyu cihui lishi jiansuo xitong de jianshe yu yingyong 现代汉语词汇历时检索系统的建设与应用 [Diachronic retrieval for modern Chinese word: System construction and its application]. Zhongwen Xinxi Xuebao 中文信息学报 [ Chinese Journal of Information Processing ], 29 (3), 169–176.

Zdic 汉典. (2025). https://www.zdic.net/

Zhan, W. 詹卫东, Guo, R. 郭锐, Chang, B. 常宝宝, Chen, Y. 陈怡然, & Chen, L. 陈龙. (2019). Beijing daxue CCL yuliaoku de yanzhi 北京大学CCL语料库的研制 [The building of the CCL corpus: Its design and implementation]. Yuliaoku Yuyanxue 语料库语言学 [ Corpus Linguistics ], 6 (1), 71-86+116.

Zhang, H., & Ji, F. (2016). Compositionality as a prototypical category: Classifying Chinese four-character idioms. Language and Cognitive Science , 2 (1), 69–97. https://doi.org/10.35534/LCS201602004

Zheng, H. (2019). The processing of two types of Chinese idioms by L1 and L2 speakers [Doctoral dissertation, University of Illinois at Urbana-Champaign]. https://hdl.handle.net/2142/105181

Zheng, H., Hu, B., & Xu, J. (2022). The development of formulaic knowledge in super-advanced Chinese language learners: Evidence from processing accuracy, speed, and strategies. Frontiers in Psychology , 13 , 796784. https://doi.org/10.3389/fpsyg.2022.796784

Dataset documentation

Chengyu dataset: Information and native speaker ratings

Overview and citation information

How to cite this dataset

Primary research

Citing data subsets

Notes on partial data coverage

Understanding coverage patterns

Value of partial data

Using partial data

Version information

Expression information variables

Core identifiers

Expression

Pinyin

Semantic information

Explanation (Zdic)

Explanation (Jnqz)

Translation (Jiao et al.)

Translation (Pleco)

Near synonyms (Zdic)

Near synonyms (Jiao et al.)

Near synonyms (Jnqz)

Antonyms (Jnqz)

Sentiment (Jnqz)

Structural and syntactic information

Syntactic structure (Zdic)

Syntactic structure (Jnqz)

Structure (Li et al. 2016)

Structural symmetry

Xingshi pattern

Usage (Zdic)

Usage (Jiao et al.)

Usage (Jnqz)

Examples (Zdic)

Examples (Jnqz)

Etymology

Etymology (Zdic)

Etymology (Jnqz)

Dynasty

Historicity

Frequency measures

BCC raw frequency

BCC frequency per million characters

BCC frequency level

CCL raw frequency

CCL frequency per million characters

CCL frequency level

zhTenTen17 raw frequency

zhTenTen17 frequency per million characters

zhTenTen17 frequency level

GigaWord2 raw frequency

GigaWord2 frequency per million characters

GigaWord2 frequency level

Overall frequency level

Commonality (Jnqz)

Frequency log10 (Zheng et al. 2022)

Character-level information

Character 1, 2, 3, 4

Character frequency ranks

Total stroke number (Zheng et al. 2022)

Native speaker rating variables

RC 2 ratings (n=196 respondents)

Recognition rating (RC 2)

Chengyu acceptability rating (RC 2)

Literal compositionality rating (RC 2)

Semantic compositionality rating (RC 2)

Character engagement rating (RC 2)

RC 3 experimental measures (n=59 participants)

Chengyu acceptability proportion (RC 3)

Dictionary agreement proportion (RC 3)

Reaction time (RC 3)

Reaction time log (RC 3)

Compositionality (RC 3)

Published descriptive norms

Familiarity rating (Zheng 2019)

Meaningfulness rating (Zheng 2019)

Compositionality rating (Zheng 2019)

Compositionality mean (Zhang & Ji 2016)

Literality rating (Zheng 2019)