Home page | MT_Flow_Pipleine |
This project has been moved under OdiaNLP GitHub organization. For more details please visit: https://odianlp.github.io/
Machine translation from English to Odia language
Analysis so far:
Machine Translation (MT) has started as early as on 1950s. Based on the progress on this field, the MT can be broadly categorized into following types:
- Rule based MT (RBMT)
- Statistical MT (SMT) and
- Neural MT (NMT)
- Hybrid MT (HMT)
The NMT is giving best score (BLEU score) followed by SMT and RBMT. As explained in this paper for Indic languages SMT is performing better (at least 10% higher) than RBMT.
Based on the reading and analysis from some existing papers, as the corpus is low, for the time-being we will go ahead with SMT (Statistical Machine Translation) first. As the corpus grows, we will start testing our luck in NMT (Neural Machine Translation)
High level Roadmap
This road map is prepared based on my extra time and availability to work. If I will get more help we can deliver early.
||Analyze and study the existing resources available on Internet
||Study the reference papers and experts in NMT and analyze their opinions
||Do same as January, concentrate more on the state-of-the-art practices
||Parallel corpora generation
||Data ingestion pipeline
||Initial draft prepared
||Read existing papers on MT and write the summary
||Parallel corpora generation and data cleaning
Further progress you can see over: https://github.com/OdiaNLP
The parallel corpora generation data has been moved to Odia Wikimedia. There will be no further work unless we have achieved at least 10k (12k/10k achieved) parallel corpus.
Detailed works completed in December 2018
- Found corpus worth around 27,000 English-Odia tab separated translation pairs
This needs work on the followings:
- Some preprocessing need to be done.
- All translation not so accurate based on few manual review
- Spelling mistakes are there
- After reading one article on preprocessing English-Hindi corpus for SMT, I got insight on the following preprocessing tasks need to be done at first:
- Punctuation should NOT be removed
- English text should be lower cased as Odia does not have any upper case or lower case
- Spell normalization (Which is like Lemmatization) will have a greater impact
- The impact of mapping numbers with unique class labels is not very effective and can be left out.
- We need NER (Named Entity Recognition) words for which Transliteration is enough
- We need POS tag data to improve the accuracy also.
- Moses MT tool is open source and ideal for developing SMT for our purpose.
- It needs parallel corpus to train and build the model after that prediction it can do.
- Data curation as explained on above point will be crucial to train this kind of system.
- If no better option available, we will mostly go with this approach.
- Other than this I got a good sense to keep an eye on Hybrid Machine Translation, which will be mixture of RBMT and SMT.
- There is not enough corpus available to go for NMT/SMT currently. We need to prepare a platform like Google translate community where we can manually create English-Odia phrase or word pairs. For this the following needed:
- UI hosted somewhere on cloud which can handle at least 100 requests per minute
- A DB, also hosted in cloud to store the data provided by users
- Possible suggestions for new users to start translating
- Review mechanism on the translated terms, even with good intention grammatical or syntactical errors need to be validated
- This infrastructure need to be hosted somewhere on cloud and it should be absolutely free to use.
- Unknown words handling : There will be definitely terms will come for which Odia words will not be present. These are some suggestions to handle unknown words:
- Transliterate those words to Odia
- Using some other existing translation system, convert those words to Hindi/Sanskrit then transliterate those words to Odia. Because most of the words in between Hindi/Sanskrit and Odia are same and people can understand.
Detailed works completed in January 2019
- Last year has been pretty exciting to prepare this plan. Hope this year we will be able to deliver something useful for the community.
- Quality of data is highly needed. Therefore started preparing phrase parallel corpus and contributing simultaneously to both Google and Facebook. How ? Google translation community and Facebook Translate are recommending phrases to translate. The same phrases I am keeping a copy to myself.
- Microsoft has given an exciting resource to public to train their own Parallel Translation model that is also FREE. Details:
- It needs more than 10,000 pairs of parallel sentence pairs
- It seems to be working on SMT
- Train, Testing, Deployment all pipeline have been there
- First attempt I have tried this tool with GNOME translation pair I got. However, It got failed may be due to only 60 parallel sentences.
- In the meantime there seems to be many work going on with frequent papers in Unsupervised NMT for low resources languages, there has not been any significant usable work yet. However, I am keeping an eye on that too.
- The Translator Hub has been retired on May 2019. It has been migrated to Microsoft Custom translator. However, Odia language is unavailable currently. I have requested them to add it.
Detailed works completed/ongoing in February 2019
- Data review and preparation started.
- Around 550 pairs reviewed till now. We need at least 10,000 pairs by end of this month. Finding ways to automate.
Data Ingestion pipeline
Six Challenges for Neural Machine Translation
- There are six challenges in NMT
- Domain mismatch
- Amount of Training Data
- Rare Words
- Long sentences
- Word Alignment
- Beam Search
New pairs addition
Bible pairs from OdiEnCorp1.0 has been added to the consolidated corpus.
- Microsoft Azure ML pricing (for basic version)
- It does not have GPU support
- Create and publish custom modules in Azure ML designer is not there
Useful Open source libraries
- fast_align : Align the words between two parallel corpus
Data collected from:
Prospective data corpus
These are few places where relevant data may be present, however getting the data is not straight forward.
- EMILLE Project :
The Oriya written corpus consists of data incorporated from the CIIL Corpus, originally gathered by the Institute of Applied Language Sciences, Bhubaneshwar (approximately 2,730,000 words).
- Gyan Nidhi-TDIL : Million pages’ multilingual parallel text corpus in English and 11 Indian languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Marathi, Malayalam, Oriya, Punjabi, Tamil & Telugu) based on Unicode encoding. The Gyan Nidhi corpus contains the text in the form of books. In these books there were number of diagrams, figures, charts and other special symbols. These are removed from the text by using automated and manual tools. The text in gyan nidhi is in the form of paragraphs, that are converted into short sentences.
- A Gold Standard Odia Raw Text Corpus : Around 15, 88, 287 words are there in this corpus available to purchase.
“In my dream of the 21st century for the State, I would have young men and women who put the interest of the State before them. They will have pride in themselves, confidence in themselves. They will not be at anybody’s mercy, except their own selves. By their brains, intelligence and capacity, they will recapture the history of Kalinga.” - Biju Pattnaik
This Website’s documentation work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.