About the Project

In this project, I focused on combining software development, Data Science, and Natural Language Processing (NLP). The project starts with data collection from a range of sources, including a dictionary, a collection of short tales, and OCR extraction to capture and digitize text from documents. This text is then aligned at the sentence level to maintain coherence and consistency.

Next, DBMT (dictionary-based machine translation) tables are generated to standardize the highly divergent spelling patterns of Hunsrickisch available across the internet. These tables are also used in the Spell Checker interface. Tokenization and n-gram analysis are performed to break down the text and identify frequent patterns. These processes are essential for improving translation quality and accuracy. The result is then calculated based on the number of occurrences, relative position, and proportional relationship of n-grams within a sentence.

From a full-stack development perspective, the project integrates a user interface with backend systems capable of handling NLP tasks. The application is deployed on the cloud to ensure scalability and accessibility.

About the Language

Hunsrickisch, with an estimated 1.2 million speakers, is the second most spoken language in Brazil. It is a type of High German, brought by immigrants from the Rhineland-Palatinate region. Today it is spoken in the states of Rio Grande do Sul, Santa Catarina, Paraná, Mato Grosso do Sul, Mato Grosso, and Espírito Santo, as well as in Paraguay and the province of Misiones in Argentina. As it is falling into disuse among young people, it is an endangered language.

This project aims to provide greater access to a standardized written form of Hunsrickisch. This spelling follows the tradition of texts published in books, newspapers, and magazines over the two centuries of the language's existence by adopting the standard developed by Piter Kehoma Boll in his Dicionário Hunsriqueano Riograndense — Português (2014), adapted from the proposal of Cléo Vilson Altenhofen (2007).

About me