Thesis project topics.

Contact me if you are interested in any of these (robv@itu.dk).

Unsupervised code-switch detection

Code-switching is the phenomenon of switching to another language within one utterance. Many previous approaches have been evaluated for a variety of language pairs; however, they are all trained on annotated code-switched data.

To increase the usefulness of such a code-switch detector, the idea is to train a system based on two monolingual datasets to predict language labels on the word level. An example of the desired output is shown below:

@friend u perform besop apa tudey ?
un en en id id en ?

Related reading:

Machine translation of social media data

While evaluation metrics of machine translation keep increasing, the current models do not perform well on non-standard (i.e. social media) text. One solution to this problem would be to transform this data a `normal' form (e.g. by using MoNoise) before translating it.

Relevant reading:

Dialogue acts classification for social media data

Dialogue acts classify the intended goal of a textual utterance. Dialogue acts have mostly been studied in the context of telephone conversations. However, when sharing utterances on social media, people generally also have an intended goal (perhaps not always, so a MISC column should be considered).

Some resources on Dialogue Acts:

Effect of sociodemographic factors on language use

Recent work has shown that including the origin of a text instance can improve performance on NLP tasks. However, it is unclear which specific sociodemographic attributes correlate with language use. Recent efforts on annotation of social media data could give us more insights.

Low-resource dependency parsing

Dependency parsing is the task of finding the syntactic relations between words in a sentence. For more information I refer to the UD page, which contains data annotated for this task for over 70 languages.

Very high scores have been obtained for this task (> 95%), however this is all done by supervised parsers, which are trained on large amounts of annotated data. I am generally interested in doing natural language processing for less-resourced situations (languages/domains). Yes, this topic is less concrete as the previous one, contact me for more concrete ideas. Some interesting recent work:

Contextualized representations for word graphs

Be aware, this is a complex topic

Use of contextualized embeddings (e.g. BERT) for word graphs. This can be used in multiple cases: