Natural Language Processing (or NLP) refers to the ability of a computer to understand, analyze, manipulate and potentially generate human language. The ultimate aim of NLP is for a machine to be able to understand humans effectively.
Should machines understand human intent contextually and correctly, it offers a broad spectrum of applications. For example, It could make Alexa & Siri smarter. It could make Google searches more relevant and things like Google Translate much more effective. These are just the surface applications of how useful NLP could be for humans going forward.
Image credit: Pxhere, CC0 Public Domain
While NLP offers broad applications, its effectiveness strives on the amount of training data fed into it. High resource NLP refers to a scenario where rich trained data is available for deep learning systems. Similarly, Low resource NLP refers to the setting where available training data is limited. Low resource NLP, therefore, could either indicate a less popular language OR context. The scarcity of trained data for a low-resource NLP makes it less effective.
NLP research in a low resource setting has gained a lot of interest in recent times, and various techniques have been proposed. NLP for low resource setting is considered as one of the biggest four problems in NLP. Overcoming this challenge would improve the participation of its speakers in the digital world, which will improve the scale of data available, which will further improve the effectiveness of NLP, which will further improve the community’s digital participation.
Promising low-resource NLP techniques are discussed in the research survey by Michael A. Hedderich, Lukas Lange, Heike Adel, Jannik Strötgen & Dietrich Klakow that forms the basis of this text. The goal of this research, as described in the researcher’s words, is to:
Explain how these methods differ in their requirements as understanding them is essential for choosing a technique suited for a specific low-resource setting. Further key aspects of this work are to highlight open issues and to outline promising directions for future research.
Low resource NLP is handled using the below techniques.
- Generating additional Labelled Data
- Data Augmentation: This means existing data is used to create more data. For example: Replacing words with synonyms, simplifying sentences, translating between active & passive voices, etc. This method generates additional sentences that are labelled.
- Distant & Weak Supervision: This means Labelling Unlabelled data. For example, the system would list locations (Example: New York, London, Germany etc.) from a dictionary. It would match the names in the list with the unlabelled data. The match is automatically labelled as location.
- Cross-lingual Annotation projections:In this technique, the data in the low-resource language is projected in a high resource language, and data Labels are obtained based on the classifier in the high resource language.
- Learning with noisy labels:Distant supervision can sometimes fetch inaccurate labels. Errors in this technique can be reduced by
- Noise Filtering: Removing instances of data that have a high probability of being incorrectly labelled.
- Noise Modeling: Using an additional classifier to assign a low weightage in case of a high probability of inaccurate labelling
- Non-experts support: In this method, non-native speakers provide annotations to the text manually.
- Transfer learning: In this method, Learned representations from a high-resource language are transferred to a low-resource language. This technique reduces the dependency for labelled targeted data and reduces the need for target supervision.
- Ideas From Low-Resource Machine Learning in Non-NLP Communities: Machine learning & computer vision offer useful insights & new ideas to transfer learning in low-resource NLP communities
The research text by Michael A. Hedderich, Lukas Lange, Heike Adel, Jannik Strötgen & Dietrich Klakow discussed various NLP methods in a low resource setting. The research aimed to give a broad & structured overview of the existing most effective low-resource NLP techniques. Necessary resources and data assumptions for the methods were also presented in the text.
A detailed comparative study of the proposed techniques is proposed as a promising future direction. The study will enable practitioners to understand how different approaches complement each other & how these can be effectively combined for a much effective result.
Source: Michael A. Hedderich, Lukas Lange, Heike Adel, Jannik Strötgen & Dietrich Klakow “A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios“