Machine Translation from English to Odia language.
Home page | MT_Flow_Pipleine | Wishlist
Date | Version | Author | Change details |
22nd Sept’19 | 0.01 | Soumendra Kumar Sahoo | Initial draft created |
This document explains to various pipeline flows for training and querying of English to Odia the translation. I have tried to explain the concepts as low level as possible for an amateur technical person.
The entire pipeline can be divided into two parts:
Inserting parallel pairs to create the Machine Learning model. You can checkout the following diagram for the entire flow of the Ingestion pipeline.
This part deals with collecting data from multiple sources. The data can be processed parallel pairs or in raw format. In a high level classification let us divide the process into three parts:
Agents means Bots, which will be running continuously 24x7 or a specific period of time to find out prospective parallel pairs. There can be many types of Agents:
It may consists of the following steps:
The cases where we will get the data from a specific location that needs to be accessed for a short time period until we collect the entire data. This may be an one time activity or long source data refresh period. For example:
The detecting of the data set is completely manual process. There is no need of automate that also. May be for the Wikipedia dump we can run an Agent in a weekly schedule to get only the incremental data.
The flow will be:
That’s it. The Data collector work is done.
Other than the above two methods of collecting the data. We will need help from the community to manually prepare the data. We may need data in following categories:
These are not distinct categories. That means Parallel pairs are mandatory to provide. The rest all POS, NER and Domain are optional. If community can provide pairs with all POS, NER and Domain, that data will be treated as pure Gold.
The format to collect the POS, NER and Domain information yet to be decided.
The data we receive in the Collection phase are most likely not in the desired format and needs to be processed with additional metadata information to bring the collections into a standard format. These methods to convert raw data into standard format will happen at this stage.
There are many crucial steps:
Let is go though these steps individually:
Licensing is a crucial thing in Open source projects. Therefore, should not be taken lightly. Legal obligations may need to be faced if not handled smartly.
The data we received need to go through a set of cleaning steps before going further.
Alignment of the pairs is the crucial part of this process. An entire codebase can be written to align the pairs. We have thousands of corpus lying around unusable due to this Alignment issue. The Alignment can be done based on the original text Morphology.
Odia does not have any open source free auto POS tagger or NER taggers. There is word2vec present, which I need to test though. In combination of these features the translation accuracy will shoot up.
For this I has asked these info from the community in the Data collection phase.
After all these steps there will be a threshold set to analyze the validity and uniqueness of the pair. The threshold can be set based on:
If any pair unable to pass any of the above conditions, should be filtered out.
We can not make a swiss army knife for any kind of translation done by a generic model. We have to find our niche domain. For the initial stage it may be the domain on which we get the maximum number of pairs.
This concept is very critical and need to understood at early stage. You can not train with Agriculture data and want to test those with Medical terms. The result will be pathetic.
Thats why those MT models who grow initially pick a specific sector and specialize on that first. If we need to specialize in generic Hi, Bye terms we need to weed out the other domain specific data pairs.
Due to this reason the domain based classification is critical during the processing phase both for the MT model and business too.
We need to classify the pairs into words, phrases and sentences.
Convert the format to standard format and make any changes if needed to the data which will be used further on the training process.
Checking:
This Website’s documentation work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.