Full variable descriptions and data sources
This dataset collates information about 1,015 chengyu 成语 'Chinese idiomatic expressions', including expression-level information and native speaker ratings. The dataset was compiled as part of my PhD research (Davey, 2026b) investigating how first-language (L1) Chinese language users understand chengyu as a category.
Data is collated from multiple sources:
When using this dataset, please cite:
Davey, J. (2026a). Chengyu dataset: Information and native speaker ratings [Dataset]. OSF. https://doi.org/10.17605/OSF.IO/3TDA7
This dataset was compiled as part of the following research:
Davey, J. (2026b). What makes a chengyu? Native speaker intuitions for categorising Chinese idiomatic expressions [Doctoral dissertation submitted for examination, Australian National University].
When reporting findings based on this dataset, please cite the dataset itself (above). The thesis citation provides theoretical and methodological context for the research design and data collection.
When using specific subsets of data from this dataset, please acknowledge the original sources in addition to citing this dataset. For example:
Source information for all variables is provided below.
Data availability : This dataset is made freely available for non-commercial research and educational purposes, licensed under a Creative Commons CC-BY-NC 4.0 Licence.
Coverage : The dataset contains information about 1,015 chengyu , including the 500 expressions in Jiao et al. (2011) plus another 515 expressions from my PhD research (Davey, 2026b) and other published sources. Some measures are available for all expressions, while others cover specific subsets as detailed below.
The dataset includes variables with varying coverage across the 1,015 expressions. This reflects the integration of data from multiple independent studies with different research aims and sampling strategies. Partial coverage does not indicate missing data but rather the scope of each original study.
Partial data serves multiple purposes:
When using this dataset:
Dataset version : 1.0
Last updated : 19/01/2026
Compiled by : Janet Davey
Contact : main.gem4761@fastmail.com
For questions about this dataset or to report errors, please contact main.gem4761@fastmail.com.
Variable name
:
expression
Source : Multiple
Definition : The chengyu in simplified Chinese characters
Coverage : n=1,015
Variable name
:
pinyin
Source : Multiple
Definition : Romanisation of the expression using Hanyu Pinyin
Coverage : n=1,015
Variable name
:
explanation_zdic
Source : Zdic 汉典 (Zdic, 2025)
Definition : Explanation of the expression's meaning in Chinese
Coverage : n=1,015
Variable name
:
explanation_jnqz
Source : 成语词典 (Jnqz, 2025)
Definition : Explanation of the expression's meaning in Chinese
Coverage : n=1,008
Variable name
:
translation_jiao
Source : Jiao et al. (2011)
Definition : English translation
Coverage : n=500
Variable name
:
translation_pleco
Source : Pleco Chinese Dictionary
Definition : English translation or gloss
Coverage : n=1,015
Variable name
:
near_synonyms_zdic
Source : Zdic 汉典 (Zdic, 2025)
Definition : Expressions with similar meanings
Coverage : n=881
Variable name
:
near_synonyms_jiao
Source : Jiao et al. (2011)
Definition : Expressions with similar meanings
Coverage : n=461
Variable name
:
near_synonyms_jnqz
Source : 成语词典 (Jnqz, 2025)
Definition : Expressions with similar meanings
Coverage : n=998
Variable name
:
antonyms_jnqz
Source : 成语词典 (Jnqz, 2025)
Definition : Expressions with opposite meanings
Coverage : n=976
Variable name
:
sentiment_jnqz
Source : 成语词典 (Jnqz, 2025)
Definition : Emotional colouring or sentiment of the expression
Categories : 褒义成语 (commendatory/positive), 中性成语 (neutral), 贬义成语 (derogatory/negative)
Coverage : n=1,004
Variable name
:
syntactic_structure_zdic
Source : Zdic 汉典 (Zdic, 2025)
Definition : Syntactic structure classification using Chinese grammatical categories
Categories : 联合式 (coordinate structure), 主谓式 (subject-predicate structure), 偏正式 (modifier-modified structure), 动宾式 (verb-object structure), 连动式 (serial verb structure), 补充式 (complement structure), 紧缩式 (abbreviated structure), 复句式 (compound structure), 复杂式 (complex structure)
Coverage : n=901
Variable name
:
syntactic_structure_jnqz
Source : 成语词典 (Jnqz, 2025)
Definition : Syntactic structure classification using Chinese grammatical categories
Categories : 联合式 (coordinate structure), 主谓式 (subject-predicate structure), 偏正式 (modifier-modified structure), 动宾式 (verb-object structure), 连动式 (serial verb structure), 补充式 (complement structure), 紧缩式 (abbreviated structure), 复句式 (compound structure), 复杂式 (complex structure)
Coverage : n=1,005
Note : The dataset includes syntactic structure classifications from both 成语词典 (Jnqz, 2025) and the Zdic 汉典 online dictionary (Zdic, 2025) because syntactic structure categorisation for chengyu is not universally standardised across reference works, and where sources disagree, users can make informed decisions based on their own analytical frameworks.
Variable name
:
structure_li2016
Source : Li et al. (2016)
Definition : Syntactic structure classification following Li et al.'s (2016) categories
Categories : VO (verb-object), SM (structure of modification), SV (subject-predicate), VV (verb-verb), VOVO, SMSM, SVSV (indicating both pairs of characters follow the same pattern)
Coverage : n=350
Variable name
:
symmetric_structure
Source : Coded by researcher following Liu and Xing (2000) and Liu and Cheung (2014)
Definition : Whether the expression exhibits structural symmetry ( duichenxing 对称性)
Categories : Binary (symmetric vs. asymmetric)
Note : Categorised as structurally symmetric when the first two characters and last two characters have identical syntactic constructions with corresponding word classes. This structural symmetry criterion is broader than syntactic parallelism ( binglieshi 并列式) alone, encompassing expressions with various semantic relationships (temporal, causal, or coordinative) provided the two halves are equally weighted.
Coverage : n=1,015
Variable name
:
xingshi
Source : 成语词典 (Jnqz, 2025)
Definition : Structural pattern ( xingshi 形式) based on character repetition
Categories : ABCD, AABB, AABC, ABAC, ABBC, ABCA, ABCB, ABCC
Coverage : n=1,015
Variable name
:
usage_zdic
Source : Zdic 汉典 (Zdic, 2025)
Definition : Usage guidance including grammatical function and pragmatic information about the expression
Coverage : n=901
Variable name
:
usage_jiao
Source : Jiao et al. (2011)
Definition : Usage guidance including grammatical function and pragmatic information about the expression
Coverage : n=500
Variable name
:
usage_jnqz
Source : 成语词典 (Jnqz, 2025)
Definition : Usage guidance including grammatical function and pragmatic information about the expression
Coverage : n=1,004
Variable name
:
examples_zdic
Source : Zdic 汉典 (Zdic, 2025)
Definition : Example sentences demonstrating usage
Coverage : n=852
Variable name
:
examples_jnqz
Source : 成语词典 (Jnqz, 2025)
Definition : Example sentences demonstrating usage
Coverage : n=1,005
Variable name
:
etymology_zdic
Source : Zdic 汉典 (Zdic, 2025)
Definition : Historical origin or etymological information about the expression, including source texts where applicable
Coverage : n=947
Variable name
:
etymology_jnqz
Source : 成语词典 (Jnqz, 2025)
Definition : Historical origin or etymological information about the expression, including source texts where applicable
Coverage : n=1,005
Variable name
:
dynasty
Source : Coded by researcher based on etymology
Definition : Dynastic period when the expression is thought to have originated. When the etymological information between 'etymology_zdic' and 'etymology_jnqz' conflicts, then the earliest attested period is used. Expressions from Buddhist texts (e.g., sutras) are coded as 'Buddhist', while those from foreign texts (e.g., Aesop's Fables, Arabian Nights) are coded as 'Foreign'. Left blank if unknown
Coverage : n=1,005
Variable name
:
historicity
Source : Coded by researcher based on dynasty and etymology
Definition : Categorisation of expressions into historical eras
Categories : Ancient (–221 BCE; pre-Qin), Early Imperial (221 BCE–589 CE; Qin, Han, Three Kingdoms, Jin, Northern and Southern Dynasties), Middle Imperial (589–1368 CE; Sui, Tang, Five Dynasties and Ten Kingdoms, Song), Late Imperial (1368–1912 CE; Yuan, Ming, Qing), Modern (1912– )
Coverage : n=1,005
Variable name
:
freq_bcc
Source : BCC (Beijing Language and Culture University Chinese Corpus)
Definition : Raw token frequency in BCC
Corpus details : 9.5 billion characters; news, social media, literature, technology, blog (compiled 2019)
Note : Frequency data extracted from BCC Global wordlist (combined subcorpora)
Reference : Xun et al. (2015, 2016)
Coverage : n=1,015
Variable name
:
freq_bcc_per_million
Source : Calculated from freq_bcc
Definition : Normalised frequency per million characters in BCC
Calculation : (freq_bcc / 9,500,000,000) × 1,000,000
Coverage : n=1,015
Variable name
:
freq_level_bcc
Source : Calculated from freq_bcc
Definition
: Classification of high, medium, or low frequency based on tertile thresholds (33rd and 66th percentiles). See
freq_level
Coverage : n=1,015
Variable name
:
freq_ccl
Source : CCL (Peking University Center for Chinese Linguistics)
Definition : Raw token frequency in the CCL modern Chinese corpus
Corpus details : 4.75 billion characters (modern Chinese portion of 2025 version); news, internet texts, literature, biographies, TV/movies, translations
Note : Frequency data obtained through batch query of CCL corpus interface
Reference : Zhan et al. (2019)
Coverage : n=1,015
Variable name
:
freq_ccl_per_million
Source : Calculated from freq_ccl
Definition : Normalised frequency per million characters in CCL modern Chinese
Calculation : (freq_ccl / 4,746,907,429) × 1,000,000
Coverage : n=1,015
Variable name
:
freq_level_ccl
Source : Calculated from freq_ccl
Definition
: Classification of high, medium, or low frequency based on tertile thresholds (33rd and 66th percentiles). See
freq_level
Coverage : n=1,015
Variable name
:
freq_zhtenten
Source : zhTenTen17 (Sketch Engine, 2025b)
Definition : Raw token frequency in the zhTenTen17 corpus
Corpus details : 16.59 billion characters; simplified Chinese web texts covering diverse genres (compiled August and November-December 2017)
Note : Frequency data obtained through Sketch Engine batch query
Coverage : n=1,015
Variable name
:
freq_zhtenten_per_million
Source : Calculated from freq_zhtenten
Definition : Normalised frequency per million characters in zhTenTen17
Calculation : (freq_zhtenten / 16,593,146,196) × 1,000,000
Coverage : n=1,015
Variable name
:
freq_level_zhtenten
Source : Calculated from freq_zhtenten
Definition
: Classification of high, medium, or low frequency based on tertile thresholds (33rd and 66th percentiles). See
freq_level
Coverage : n=1,015
Variable name
:
freq_gigaword2
Source : GigaWord 2 (Sketch Engine, 2025a)
Definition : Raw token frequency in the GigaWord 2 corpus
Corpus details : 250.12 million characters; newswire texts (compiled 2005)
Note : Frequency data obtained through Sketch Engine batch query
Coverage : n=1,015
Variable name
:
freq_gigaword2_per_million
Source : Calculated from freq_gigaword2
Definition : Normalised frequency per million characters in GigaWord 2
Calculation : (freq_gigaword2 / 250,124,230) × 1,000,000
Coverage : n=1,015
Variable name
:
freq_level_gigaword2
Source : Calculated from freq_gigaword2
Definition
: Classification of high, medium, or low frequency based on tertile thresholds (33rd and 66th percentiles). See
freq_level
Coverage : n=1,015
Variable name
:
freq_level
Source : Coded by researcher based on composite corpus frequencies
Definition : Tripartite classification of expressions as high, medium or low frequency based on overall occurrence across four Chinese corpora: BCC, CCL, zhTenTen17, and GigaWord2
Measurement : For each corpus, tertile thresholds (33rd and 66th percentiles) were calculated using the per-million-token frequencies of all expressions in the dataset. The tertile values were: BCC (33rd = 1.020, 66th = 2.935), CCL (33rd = 0.254, 66th = 0.791), zhTenTen17 (33rd = 0.255, 66th = 1.064), and GigaWord2 (33rd = 0.088, 66th = 0.793). Each expression was then classified as high, medium, or low frequency within each individual corpus based on these tertiles. The final freq_level value represents the modal category across all four corpora (i.e., whichever frequency level appeared most frequently across the four classifications). In cases where there was a tie between categories, a conservative approach was adopted: ties between high and medium defaulted to medium, ties between medium and low defaulted to medium, and ties between high and low (with no medium classifications) also defaulted to medium.
Note : This composite measure aims to account for substantial variation in chengyu prevalence and text types across different corpora while providing an interpretable summary measure. The three-level classification reflects the fact that this dataset is biased towards more recognisable and learner-relevant chengyu suitable for research; a binary high/low classification risked mislabelling genuinely common expressions as 'low frequency' simply because they fell below the median of an already relatively high-frequency set. The tertile-based approach ensures that 'low frequency' is reserved for expressions that are genuinely rare within this collection.
Coverage : n=1,015
Variable name
:
commonality_jnqz
Source : 成语词典 (Jnqz, 2025)
Definition : Commonality or frequency of use indicator
Categories : 常用成语 (common chengyu ), 一般成语 (general/ordinary chengyu )
Coverage : n=1,000
Variable name
:
freq_log10_zheng2022
Source : Zheng et al. (2022)
Definition : Log₁₀-transformed page count from Baidu.com
Measurement : "We adopted Libben and Titone's (2008) method using the most popular Chinese website search engine [Baidu] as the dataset, and we employed the log-transformed page count to represent the whole-form frequency" (Zheng et al., 2022, pp. 5-6)
Coverage : n=24
Variable name
:
char1
,
char2
,
char3
,
char4
Source : Derived from expression
Definition : The four individual characters comprising the expression
Coverage : n=1,015
Variable name
:
char1_freq_rank
,
char2_freq_rank
,
char3_freq_rank
,
char4_freq_rank
Source : Da (2004)
Definition : Frequency rank of each constituent character based on Jun Da's character frequency list
Coverage : n=1,015 (char3_freq_rank: n=1,014 with no frequency information for 俱)
Variable name
:
stroke_total_zheng2022
Source : Zheng et al. (2022)
Definition : Total number of strokes across all four characters
Coverage : n=24
All RC 2 measures used 5-point Likert-type scales administered via online questionnaire. Respondents were L1 Chinese adults (18+ years) who had at least middle school education and were familiar with the term chengyu . Each respondent rated 25 randomly selected expressions from the 500-item pool.
Variable name
:
recognition_mean_rc2
Source : RC 2 (original data)
Definition : How well respondents know and understand the expression
Measurement : 5-point scale: 1=不知道,不理解 (don't know, don't understand) to 5=知道,理解 (know, understand)
Coverage : n=500
Variable name
:
chengyu_acceptability_mean_rc2
Source : RC 2 (original data)
Definition : Whether respondents consider the expression to be a chengyu
Measurement : 5-point Likert scale: 1=肯定不是 (definitely not a chengyu ) to 5=肯定是 (definitely a chengyu )
Coverage : n=500
Variable name
:
compositionality_literal_mean_rc2
Source : RC 2 (original data)
Definition : The likelihood that someone unfamiliar with the expression but knowing the individual characters could derive its literal meaning ( zimian yisi 字面意思)
Measurement : 5-point scale: 1=极不可能 (extremely unlikely) to 5=极有可能 (extremely likely)
Coverage : n=500
Variable name
:
compositionality_semantic_mean_rc2
Source : RC 2 (original data)
Definition : The likelihood that someone unfamiliar with the expression but knowing the individual characters could derive its overall meaning ( zhengti hanyi 整体含义)
Measurement : 5-point scale: 1=极不可能 (extremely unlikely) to 5=极有可能 (extremely likely)
Coverage : n=500
Variable name
:
character_engagement_mean_rc2
Source : RC 2 (original data)
Definition : Whether respondents attend to the constituent characters when reading the expression and/or consider the constituent characters useful for meaning recall
Measurement : 5-point scale: 1=不注意,想不起含义 (do not attend to characters, cannot recall meaning) to 5=注意,很有帮助 (attend to characters, very useful for meaning recall)
Coverage : n=500
RC 3 employed a timed binary judgement task where participants made rapid decisions about whether expressions were chengyu or not. Participants were L1 Chinese adults (18+ years) who had at least middle school education and were familiar with the term chengyu . Each participant judged 96 expressions (48 dictionary-listed chengyu and 48 non- chengyu ) using keyboard responses.
Variable name
:
chengyu_acceptability_prop_rc3
Source : RC 3 (original data)
Definition : Proportion of participants who judged the expression as a chengyu
Measurement : Keyboard response (A or L key, counterbalanced across participants). Participants with even subject numbers: A=is not chengyu , L=is chengyu ; odd subject numbers: reversed mapping
Note : Task instructions emphasised responding as quickly and accurately as possible to capture intuitive judgements
Coverage : n=48 (subset of dictionary-listed chengyu )
Variable name
:
dictionary_agreement_prop_rc3
Source : RC 3 (derived variable)
Definition : Proportion of participants whose chengyu judgement matched the authoritative dictionary classification
Measurement : Binary (agrees/disagrees) for each participant, calculated post-hoc by comparing participant judgement to dictionary status, then aggregated as proportion
Coverage : n=48
Variable name
:
reaction_time_ms_rc3
Source : RC 3 (original data)
Definition : Mean time from expression presentation to keyboard response
Measurement : Milliseconds. Overall mean RT=1,133ms, median RT=914ms. Outlier trials (>4,000ms or >3 SD from participant mean) excluded
Note : Fast RTs suggest predominantly intuitive rather than deliberative processing
Coverage : n=48 expressions (averaged across 59 participants = 5,385 valid trials after outlier removal)
Variable name
:
reaction_time_log_rc3
Source : RC 3 (derived)
Definition : Log-transformed reaction times
Coverage : n=48
Variable name
:
compositionality_rc3
Source : Coded by researcher
Definition : Whether the expression's meaning can be derived from its constituent characters
Categories : Binary (compositional vs noncompositional)
Note : Independent variable (stimulus characteristic) coded by researcher, not rated by participants
Coverage : n=48 (RC 3 stimuli)
Variable name
:
familiarity_zheng2019
Source : Zheng (2019)
Definition : How often speakers encounter an idiom based on personal experience
Measurement : 5-point Likert scale: 1=never heard, read, or produced; 5=heard, read, or produced very often
Coverage : n=237
Variable name
:
meaningfulness_zheng2019
Source : Zheng (2019)
Definition : How well speakers believe they know the figurative meaning
Measurement : 5-point scale: 1=absolutely no idea what the idiom means; 5=100% certain of meaning and could explain explicitly
Coverage : n=237
Variable name
:
compositionality_zheng2019
Source : Zheng (2019)
Definition : Extent to which literal meanings of constituent characters relate to overall figurative meaning
Measurement : 5-point scale: 1=absolutely not decomposable; 5=completely decomposable. Participants received figurative definition alongside each item and rated whether constituent parts contribute to expression meaning
Note : Adopted from Bonin, Méot, & Bugaiska (2013) approach
Coverage : n=237
Variable name
:
compositionality_mean_zhangji2016
Source : Zhang & Ji (2016)
Definition : Mean rating of degree to which idiomatic meaning can be inferred from literal interpretation
Measurement : 5-point scale: 1=not related; 5=very related. Forty informants assessed semantic relevance between refined literal interpretations and commonly-used idiomatic meanings
Coverage : n=146
Variable name
:
literality_zheng2019
Source : Zheng (2019)
Definition : Possibility that an idiom could be used literally in the real world
Measurement : 5-point scale: 1=absolutely not plausible; 5=completely plausible
Coverage : n=237
Variable name
:
predictability_zheng2019
Source : Zheng (2019)
Definition : Likelihood that speakers complete an idiom fragment idiomatically
Measurement : Proportion of participants who completed three-character fragment (with last character missing) with the expected final character to form the idiom
Note : Participants typed first word coming to mind to make phrase grammatical and meaningful
Coverage : n=237
Variable name
:
register_zheng2019
Source : Zheng (2019)
Definition : Which language register (written/formal or spoken/informal) the idiom belongs to
Measurement : Binary: 1=written (formal) language; 0=spoken (informal) language
Coverage : n=237
Variable name
:
l2_knowledge_rating_zheng2022
Source : Zheng et al. (2022)
Definition : Likelihood that advanced L2 Chinese learners would know the idiom
Measurement : 5-point scale rated by 33 experienced Chinese language teachers
Coverage : n=24
Raw frequency data comes from four major Chinese corpora. Each corpus has different strengths and coverage patterns, which is why normalised frequency measures (per million characters) are provided to enable direct comparison. The composite frequency classification (for RC 3) used multiple sources to ensure robust categorisation.
The four corpora differ substantially in:
For these reasons, the normalised frequencies (per million characters) enable meaningful comparison across corpora, and the composite frequency classification used for RC 3 provides a more robust measure than any single corpus.
"Chengyu Da Cidian" editorial board 《成语大词典》编委会 (Ed.). (2020). Chengyu da cidian (caise ben) 成语大词典 (彩色本) [ Great dictionary of chengyu (colour version) ] (Xin xiuding ban 新修订版 [Rev. ed.]). Shangwu Yinshuguan 商务印书馆 [Commercial Press].
Da, J. (2004). A corpus-based study of character and bigram frequencies in Chinese e-texts and its implications for Chinese language instruction. In P. Zhang, T. Xie, & J. Xu (Eds.), The studies on the theory and methodology of the digitalized Chinese teaching to foreigners: Proceedings of the fourth international conference on new technologies in teaching and learning Chinese (pp. 501–511). Tsinghua University Press. https://lingua.mtsu.edu/academic/dajun-4thtech.pdf
Davey, J. (2026a). Chengyu dataset: Native speaker ratings and linguistic information [Dataset]. OSF. https://doi.org/10.17605/OSF.IO/3TDA7
Davey, J. (2026b). What makes a chengyu? Native speaker intuitions for categorising Chinese idiomatic expressions [Doctoral dissertation submitted for examination, Australian National University].
Dictionary Editing Office, Institute of Linguistics, Chinese Academy of Social Sciences 中国社会科学院语言研究所词典编辑室. (2014). Xiandai Hanyu cidian 现代汉语词典 [ Contemporary Chinese dictionary ] (6th ed.). Shangwu Yinshuguan 商务印书馆 [Commercial Press].
Graff, D., & Chen, K. (2003). Chinese Gigaword LDC2003T09. Web Download. Linguistic Data Consortium. https://doi.org/10.35111/n069-0642
Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125–127). https://www.sketchengine.eu/wp-content/uploads/The_TenTen_Corpus_2013.pdf
Jiao, L., Kubler, C. C., & Zhang, W. (2011). 500 common Chinese idioms: An annotated frequency dictionary . Routledge.
Jnqz. (2025). Chengyu cidian 成语词典 [ Chengyu dictionary ]. Jinan Di Qi Zhongxue Chengyu Wang 济南第七中学成语网 [Jinan No. 7 Middle School chengyu website]. https://www.jnqz.cn/
Li, D., Zhang, Y., & Wang, X. (2016). Descriptive norms for 350 Chinese idioms with seven syntactic structures. Behavior Research Methods , 48 (4), 1678–1693. https://doi.org/10.3758/s13428-015-0692-y
Liu, L., & Cheung, H. T. (2014). Acquisition of Chinese quadra-syllabic idiomatic expressions: Effects of semantic opacity and structural symmetry. First Language , 34 (4), 336–353. https://doi.org/10.1177/0142723714544409
Liu Z. 刘振前, & Xing M. 邢梅萍. (2000). Hanyu sizige chengyu yuyi jiegou de duichenxing he renzhi 汉语四字格成语语义结构的对称性和认知 [The semantic symmetrical features of four-character chengyu in Chinese and their effects on cognition]. Shijie Hanyu Jiaoxue 世界汉语教学 [ Chinese Teaching in the World ], 1 , 77–81.
Pleco Inc. (2025). Pleco Chinese dictionary (Version 3.2.76 Mobile app) [Computer software]. Apple App Store. https://apps.apple.com/us/app/pleco-chinese-dictionary/
Sketch Engine. (2025a). Chinese Gigaword corpus . https://www.sketchengine.eu/chinese-gigaword/
Sketch Engine. (2025b). ZhTenTen – Chinese corpus from the web . https://www.sketchengine.eu/zhtenten-chinese-corpus
Xun, E. 荀恩东, Rao, G. 饶高琦, Xiao, X. 肖晓悦, & Zang, J. 臧娇娇. (2016). Dashuju Beijingxia BCC yuliaoku de yanzhi 大数据背景下BCC语料库的研制 [The construction of the BCC corpus in the age of big data]. Yuliaoku Yuyanxue 语料库语言学 [ Corpus Linguistics ], 3 (1), 93–118.
Xun, E. 荀恩东, Rao, G. 饶高琦, Xie, J. 谢佳丽, & Huang, Z. 黄志斌. (2015). Xiandai Hanyu cihui lishi jiansuo xitong de jianshe yu yingyong 现代汉语词汇历时检索系统的建设与应用 [Diachronic retrieval for modern Chinese word: System construction and its application]. Zhongwen Xinxi Xuebao 中文信息学报 [ Chinese Journal of Information Processing ], 29 (3), 169–176.
Zdic 汉典. (2025). https://www.zdic.net/
Zhan, W. 詹卫东, Guo, R. 郭锐, Chang, B. 常宝宝, Chen, Y. 陈怡然, & Chen, L. 陈龙. (2019). Beijing daxue CCL yuliaoku de yanzhi 北京大学CCL语料库的研制 [The building of the CCL corpus: Its design and implementation]. Yuliaoku Yuyanxue 语料库语言学 [ Corpus Linguistics ], 6 (1), 71-86+116.
Zhang, H., & Ji, F. (2016). Compositionality as a prototypical category: Classifying Chinese four-character idioms. Language and Cognitive Science , 2 (1), 69–97. https://doi.org/10.35534/LCS201602004
Zheng, H. (2019). The processing of two types of Chinese idioms by L1 and L2 speakers [Doctoral dissertation, University of Illinois at Urbana-Champaign]. https://hdl.handle.net/2142/105181
Zheng, H., Hu, B., & Xu, J. (2022). The development of formulaic knowledge in super-advanced Chinese language learners: Evidence from processing accuracy, speed, and strategies. Frontiers in Psychology , 13 , 796784. https://doi.org/10.3389/fpsyg.2022.796784