Thesis project topics.
Contact me if you are interested in any of these (firstname.lastname@example.org).
Unsupervised code-switch detection
Code-switching is the phenomenon of switching to another language within one utterance. Many previous approaches have been evaluated for a variety of language pairs; however, they are all trained on annotated code-switched data.
To increase the usefulness of such a code-switch detector, the idea is to train a system based on two monolingual datasets to predict language labels on the word level. An example of the desired output is shown below:
- Code-Mixing in Social Media Text
- Overview for the First Shared Task on Language Identification in Code-Switched Data
- Overview for the Second Shared Task on Language Identification in Code-Switched Data
- Overview of FIRE-2015 Shared Task on Mixed Script Information Retrieval
- Overview of the Mixed Script Information Retrieval (MSIR) at FIRE-2016
- A Fast, Compact, Accurate Model for Language Identification of Codemixed Text
Machine translation of social media data
While evaluation metrics of machine translation keep increasing, the current models do not perform well on non-standard (i.e. social media) text. One solution to this problem would be to transform this data a `normal' form (e.g. by using MoNoise) before translating it.
- MTNT: a testbed for MT of noisy text
- A Beam-Search Decoder for Normalization of Social Media Text with Application to Machine Translation
- Microblogs as Parallel Corpora
- Findings of the First Shared Task on Machine Translation Robustness
Dialogue acts classification for social media data
Dialogue acts classify the intended goal of a textual utterance. Dialogue acts have mostly been studied in the context of telephone conversations. However, when sharing utterances on social media, people generally also have an intended goal (perhaps not always, so a MISC column should be considered).
Some resources on Dialogue Acts:
- Definition of the task
- John Langshaw Austin. How to Do Things with Words
Effect of sociodemographic factors on language useRecent work has shown that including the origin of a text instance can improve performance on NLP tasks. However, it is unclear which specific sociodemographic attributes correlate with language use. Recent efforts on annotation of social media data could give us more insights.
- Women’s Syntactic Resilience and Men’s Grammatical Luck: Gender-Bias in Part-of-Speech Tagging and Dependency Parsing
- Gender Differences in English Syntax
- Cross-lingual syntactic variation over age and gender
Low-resource dependency parsing
Dependency parsing is the task of finding the syntactic relations between words in a sentence. For more information I refer to the UD page, which contains data annotated for this task for over 70 languages.
Very high scores have been obtained for this task (> 95%), however this is all done by supervised parsers, which are trained on large amounts of annotated data. I am generally interested in doing natural language processing for less-resourced situations (languages/domains). Yes, this topic is less concrete as the previous one, contact me for more concrete ideas. Some interesting recent work:
- A systematic comparison of methods for low-resource dependency parsing on genuinely low-resource languages
- Cross-lingual Parsing with Polyglot Training and Multi-treebank Learning: A Faroese Case Study
- How to Parse Low-Resource Languages: Cross-Lingual Parsing, Target Language Annotation, or Both?
Contextualized representations for word graphs
Be aware, this is a complex topic
Use of contextualized embeddings (e.g. BERT) for word graphs. This can be used in multiple cases:
- Simplified (number of nodes are known): http://robvandergoot.com/doc/acl17.pdf
- Could also be more advanced, when parsing full word-graphs. Then insertion/deletion/splitting of words can be included.