COMPUTATIONAL CORPUS LINGUISTICS: A SYSTEMATIC LITERATURE REVIEW OF METHODOLOGICAL INTEGRATION, TECHNOLOGICAL INNOVATIONS, AND ARTIFICIAL INTELLIGENCE APPLICATIONS (2015-2025)

Authors

  • Candiawan Telambanua Bali Bussines School Author

Keywords:

computational corpus linguistics, natural language processing, machine learning, artificial intelligence, deep learning, large language models.

Abstract

This systematic literature review explores the convergence of computational methodologies and corpus linguistics between 2015 and 2025, synthesizing insights from 52 empirical studies retrieved from authoritative academic databases, including the ACL Anthology, Scopus, and Web of Science. The review examines how advances in natural language processing, machine learning, and artificial intelligence have reshaped the theoretical and methodological foundations of corpus linguistics, enabling analysis of massive textual datasets that exceed the capacity of traditional corpus tools. Findings reveal five dominant computational innovations driving this transformation: deep learning for automated annotation and classification, text similarity modeling for large-scale corpus comparison, topic modeling and distributional semantics for linguistic pattern discovery, neural machine translation for multilingual corpus processing, and large language model integration for corpus construction and analytical enhancement. These developments mark substantial progress in computational efficiency, scalability to billion-word corpora, and automation of formerly manual linguistic tasks. Nevertheless, the review identifies persistent challenges related to algorithmic bias, interpretability, ethical responsibility in automated language analysis, and the widening digital divide affecting under-resourced languages. Theoretically, the study maps emerging synergies between computational and linguistic paradigms, highlights hybrid research frameworks uniting symbolic and statistical approaches, and proposes ethical principles for responsible computational corpus inquiry. Practically, it underscores the urgency of interdisciplinary training that bridges linguistics and computer science, the development of interpretable and transparent machine learning models for linguistic research, and the equitable allocation of computational resources to support global linguistic diversity. The review concludes by outlining future research directions in explainable AI, multimodal corpus analysis, and decolonial perspectives on language technology development.

References

Biber, D., Reppen, R., & Friginal, E. (2021). The Routledge handbook of corpus linguistics. London: Routledge.

Bird, S. (2022). Decolonising speech and language technology. Computer Speech & Language, 74, 101412.

Cheung, L., & Crosthwaite, P. (2025). CorpusChat: Integrating corpus linguistics and generative AI for academic writing development. Computer Assisted Language Learning, 1-27.

Crosthwaite, P., Baisa, V., & Boulton, A. (2023). Research trends in corpus linguistics: A bibliometric analysis of two decades of Scopus-indexed corpus linguistics research in arts and humanities. International Journal of Corpus Linguistics, 28(3), 344-377.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT (pp. 4171-4186).

Dunn, J. (2022). Natural language processing for corpus linguistics. Cambridge: Cambridge University Press.

Dunn, J., & Adams, B. (2020). Geographically-balanced gigaword corpora for 50 language varieties. In Proceedings of the 12th Language Resources and Evaluation Conference (pp. 2521-2529).

Fonteyn, L., Manjavacas, E., & De Regt, J. (2025). Using machine learning to automate data annotation in corpus linguistics: A case study with MacBERTh. International Journal of Corpus Linguistics, 30(3), 296-315.

Incelli, E. (2025). Exploring the future of corpus linguistics: Innovations in AI and social impact. International Journal of Mass Communication, 3, 1-10.

Joshi, P., Santy, S., Budhiraja, A., Bali, K., & Choudhury, M. (2020). The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 6282-6293).

Jurafsky, D., & Martin, J. H. (2024). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition (3rd ed.). Stanford: Pearson.

Kalaš, F. (2025). Bridging tradition and innovation: Analysing language data with ChatGPT-4 in corpus linguistics. Available at SSRN: https://doi.org/10.2139/ssrn.5126316

Kilgarriff, A., Rychly, P., Smrz, P., & Tugwell, D. (2006). The Sketch Engine. In Proceedings of EURALEX (pp. 105-116).

Lau, E. (2024). Advancements in neural machine translation: Methodological innovations and empirical insights for cross-linguistic discourse preservation. International Journal for Research in Applied Science and Engineering Technology, 12(4), 5767-5772.

McEnery, T., & Baker, P. (2016). Corpus linguistics and 17th-century prostitution: Computational linguistics illuminates historical social issues. London: Bloomsbury.

McEnery, T., & Hardie, A. (2012). Corpus linguistics: Method, theory and practice. Cambridge: Cambridge University Press.

McEnery, T., & Hardie, A. (2019). Corpus linguistics: Method, theory and practice (2nd ed.). Cambridge: Cambridge University Press.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

Pérez-Paredes, P., Curry, N., & Aguado Jiménez, P. (2025). Integrating critical corpus and AI literacies in applied linguistics: A mixed-methods study. Computer Assisted Language Learning, 1-27.

Reveilhac, M., & Schneider, G. (2025). Evaluating a transparent and interpretable approach to stance detection using linguistic markers in social media data. International Journal of Corpus Linguistics, 30(2), 195-233.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (pp. 5998-6008).

Yang, L., Zhou, G., & Lin, L. (2025). From Confucius to computational linguistics: Quantifying cross-linguistic semantic similarity and semantic fidelity using large language models. Digital Scholarship in the Humanities, 40(3), 1021-1032.

Yu, D., Li, L., & Su, H. (2023). Using LLM-assisted annotation for corpus linguistics: A case study of local grammar analysis. arXiv preprint arXiv:2307.00101.

Downloads

Published

2025-11-26

Issue

Section

Articles