Pan-Art Pedagogy. Theory & Practice Philology. Theory & Practice Manuscript

Archive of Scientific Articles

ISSUE:    Philology. Theory & Practice. 2024. Volume 17. Issue 4
COLLECTION:    Applied Linguistics

All issues

License Agreement on scientific materials use.

Using machine learning for the topic annotation of oral speech corpus texts

Elena Nikolaevna Pogodaeva
Tomsk State University


Submitted: April 25, 2024
Abstract. The research aims to determine the effectiveness of the thesaurus method for forming a list of topic classes when using machine learning for the topic classification of text materials of sociolinguistic interviews. The paper considers the potential of using machine learning in the topic annotation of linguistic corpus materials. The polytopical nature of the analyzed material is due to its genre belonging to dialogical speech. The hierarchical structure of the topics, identified as a result of a preliminary introspective analysis of the texts, can be described using a thesaurus. The results of using the unsupervised machine learning method are discussed involving two sets of topic class names: a list of topics used in manual text annotation and an extended list of micro-topics whose names were selected from a Russian language thesaurus. The paper is novel in that it is the first to propose the thesaurus method for selecting topic labels for the zero-shot classification of weakly structured Russian texts. The research findings show that using a more detailed lexical description for topic classes improves the classification result.
Key words and phrases:
лингвистический корпус
машинное обучение
тематическая классификация
разметка данных
диалогическая речь
linguistic corpus
machine learning
topic classification
data annotation
dialogical speech
Reader Open the whole article in PDF format. Free PDF-files viewer can be downloaded here.
References:
  1. Баранов А. Н., Добровольский Д. О. Корпусная модель идиостиля Достоевского. М.: ЛЕКСРУС, 2021.
  2. Захаров В. П., Богданова С. Ю. Корпусная лингвистика. СПб.: Изд-во С.-Петерб. ун-та, 2020.
  3. Казакевич О. А. О принципах построения функциональной типологии малых языков (на материале малых автохтонных языков Сибири и Дальнего Востока) // Функциональное развитие языков в полиэтнических странах мира (Россия – Вьетнам): материалы международного круглого стола. М.: Азбуковник, 2015.
  4. Лукашевич Н. В. Тезаурусы в задачах информационного поиска. М., 2010.
  5. Ляшевская О. Н. Корпусные инструменты в грамматических исследованиях русского языка. М.: Издательский дом ЯСК; Рукописные памятники Древней Руси, 2016.
  6. Резанова З. И. Корпус устной речи русско-тюркских билингвов Южной Сибири: разметка отклонений от речевого стандарта // Вопросы лексикографии. 2019. № 15.
  7. Резанова З. И. Подкорпус устной речи русско-тюркских билингвов Южной Сибири: типологически релевантные признаки // Вопросы лексикографии. 2017. № 11.
  8. Bhambhoria R., Chen L., Zhu X. A Simple and Effective Framework for Strict Zero-Shot Hierarchical Classification // arXiv. 2023. Art. 2305.15282. https://doi.org/10.48550/arXiv.2305.15282
  9. Marian V., Blumenfeld H. K., Kaushanskaya M. The Language Experience and Proficiency Questionnaire (LEAP-Q): Assessing Language Profiles in Bilinguals and Multilinguals // Journal of Speech, Language, and Hearing Research. 2007. Vol. 50 (4).
  10. Plaza-del-Arco F., Nozza D., Hovy D. Wisdom of Instruction-Tuned Language Model Crowds. Exploring Model Label Variation // arXiv. 2023. Art. 2307.12973. https://doi.org/10.48550/arXiv.2307.12973.
  11. Rothman D. Transformers for Natural Language Processing and Computer Vision. Birmingham: Packt Publishing, 2024.
  12. Singh J. Natural Language Processing in the Real World: Text Processing, Analytics, and Classification. 1st ed. N. Y.: Chapman and Hall, 2023.
  13. Song Y., Upadhyay S., Peng H., Mayhew S., Roth D. Toward Any-Language Zero-Shot Topic Classification of Textual Documents // Artificial Intelligence. 2019. Vol. 274.
  14. Wang Z., Pang Y., Lin Y. Large Language Models Are Zero-Shot Text Classifiers // arXiv. 2023. Art. 2312.01044. https://doi.org/10.48550/arXiv.2312.01044
  15. Zhang Y., Yang R., Xu X., Xiao J., Shen J., Han J. TELEClass: Taxonomy Enrichment and LLM-Enhanced Hierarchical Text Classification with Minimal Supervision // arXiv. 2024. Art. 2403.00165. https://doi.org/10.48550/arXiv.2403.00165
All issues


© 2006-2024 GRAMOTA Publishing