Machine translation from English to Odia language
Analysis so far:
Machine Translation (MT) has started as early as on 1950s. Based on the progress on this field, the MT can be broadly categorized into following types:
- Rule based MT (RBMT)
- Statistical MT (SMT) and
- Neural MT (NMT)
- Hybrid MT (HMT)
The NMT is giving best score (BLEU score) followed by SMT and RBMT. As explained in this paper for Indic languages SMT is performing better (at least 10% higher) than RBMT. Based on the reading and analysis from some existing papers, as the corpus is low, for the time-being we will go ahead with SMT (Statistical Machine Translation) first. As the corpus grows, we will start testing our luck in NMT (Neural Machine Translation)
High level Roadmap
This road map is prepared based on my extra time and availability to work. If I will get more help we can deliver early.
|December||2018||Analyze and study the existing resources available on Internet||Completed|
|January||2019||Study the reference papers and experts in NMT and analyze their opinions||Completed|
|February||2019||Do same as January, concentrate more on the state-of-the-art practices||Completed|
|Mar-Dec||2019||Parallel corpora generation|
The parallel corpora generation data has been moved to Odia Wikimedia. There will be no further work unless we have achieved at least 10k (3.8k/10k achieved) parallel corpus.
Detailed works completed in December 2018
- Found corpus worth around 27,000 English-Odia tab separated translation pairs
This needs work on the followings:
- Some preprocessing need to be done.
- All translation not so accurate based on few manual review
- Spelling mistakes are there
- After reading one article on preprocessing English-Hindi corpus for SMT, I got insight on the following preprocessing tasks need to be done at first:
- Punctuation should NOT be removed
- English text should be lower cased as Odia does not have any upper case or lower case
- Spell normalization (Which is like Lemmatization) will have a greater impact
- The impact of mapping numbers with unique class labels is not very effective and can be left out.
- We need NER (Named Entity Recognition) words for which Transliteration is enough
- We need POS tag data to improve the accuracy also.
- Moses MT tool is open source and ideal for developing SMT for our purpose.
- It needs parallel corpus to train and build the model after that prediction it can do.
- Data curation as explained on above point will be crucial to train this kind of system.
- If no better option available, we will mostly go with this approach.
- Other than this I got a good sense to keep an eye on Hybrid Machine Translation, which will be mixture of RBMT and SMT.
- There is not enough corpus available to go for NMT/SMT currently. We need to prepare a platform like Google translate community where we can manually create English-Odia phrase or word pairs. For this the following needed:
- UI hosted somewhere on cloud which can handle at least 100 requests per minute
- A DB, also hosted in cloud to store the data provided by users
- Possible suggestions for new users to start translating
- Review mechanism on the translated terms, even with good intention grammatical or syntactical errors need to be validated
- This infrastructure need to be hosted somewhere on cloud and it should be absolutely free to use.
- Unknown words handling : There will be definitely terms will come for which Odia words will not be present. These are some suggestions to handle unknown words:
- Transliterate those words to Odia
- Using some other existing translation system, convert those words to Hindi/Sanskrit then transliterate those words to Odia. Because most of the words in between Hindi/Sanskrit and Odia are same and people can understand.
Detailed works completed in January 2019
- Last year has been pretty exciting to prepare this plan. Hope this year we will be able to deliver something useful for the community.
- Quality of data is highly needed. Therefore started preparing phrase parallel corpus and contributing simultaneously to both Google and Facebook. How ? Google translation community and Facebook Translate are recommending phrases to translate. The same phrases I am keeping a copy to myself.
- Microsoft has given an exciting resource to public to train their own Parallel Translation model that is also FREE. Details:
- It needs more than 10,000 pairs of parallel sentence pairs
- It seems to be working on SMT
- Train, Testing, Deployment all pipeline have been there
- First attempt I have tried this tool with GNOME translation pair I got. However, It got failed may be due to only 60 parallel sentences.
- In the meantime there seems to be many work going on with frequent papers in Unsupervised NMT for low resources languages, there has not been any significant usable work yet. However, I am keeping an eye on that too.
- The Translator Hub has been retired on May 2019. It has been migrated to Microsoft Custom translator. However, Odia language is unavailable currently. I have requested them to add it.
Detailed works completed/ongoing in February 2019
- Data review and preparation started.
- Around 550 pairs reviewed till now. We need at least 10,000 pairs by end of this month. Finding ways to automate.
- Get at least 10,000 parallel open corpus for Odia language to begin with.
- Verification of the existing corpus badly needed.
- Moses does not run on Windows. Need an Ubuntu OS to test that.
- Need a cloud system to host manual translation API server and in future for online translation. Is it Microsoft ?
- Apertium Wiki for Odia language
- Indic Languages Multilingual Parallel Corpus
- The RGNLP Machine Translation Systems for WAT 2018
- Anuvadaksh- An online existing English-Odia translator
- Wordnet for Odia
- RBMT vs SMT
- Detail MT system analysis of Indic languages
- English-Punjabi parallel corpus creation
- Creating more corpus by breaking long sentences
Useful Open source libraries
- fast_align : Align the words between two parallel corpus
Data collected from:
- Wikipedia Data dump
- Open Parallel Corpus
- OdiEnCorp 1.0
- TDIL - Technical strings 52,000 pairs-Data needs to be cleaned
Prospective data corpus
These are few places where relevant data may be present, however getting the data is not straight forward.
- EMILLE Project : The Oriya written corpus consists of data incorporated from the CIIL Corpus, originally gathered by the Institute of Applied Language Sciences, Bhubaneshwar (approximately 2,730,000 words).
- Gyan Nidhi-TDIL : Million pages’ multilingual parallel text corpus in English and 11 Indian languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Marathi, Malayalam, Oriya, Punjabi, Tamil & Telugu) based on Unicode encoding. The Gyan Nidhi corpus contains the text in the form of books. In these books there were number of diagrams, figures, charts and other special symbols. These are removed from the text by using automated and manual tools. The text in gyan nidhi is in the form of paragraphs, that are converted into short sentences.
- Soumendra Kumar Sahoo
“In my dream of the 21st century for the State, I would have young men and women who put the interest of the State before them. They will have pride in themselves, confidence in themselves. They will not be at anybody’s mercy, except their own selves. By their brains, intelligence and capacity, they will recapture the history of Kalinga.” - Biju Pattnaik
This Website’s documentation work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.