Today, let's discuss about Machine Translation or MT further.
There are two trends of MT technology.
One is Rule Based Machine Translation or RBMT.
One is Statistics Based Machine Translation or SBMT.
There are two trends of MT technology.
One is Rule Based Machine Translation or RBMT.
One is Statistics Based Machine Translation or SBMT.
RBMT is theorogical way.
It knows about the grammer of both source language and target language.
First it analysis the source language (say it was English), what is subject, verb, object, complement, and so on.
It can recognize a noun clause, infinitive, so it would create the tree list as the result of analysis.
Then it translate each word, construct the target language (say it was Japanese), and then create the plain Japanese.
It's logical way and what the students do to translate the foreign language.
The strong point of this way is that you can translate any English if the sentence is clean, logical, and the MT has enough dictionary.
The weak point is as follows:
1) It translates literally.
For example, in English book, there are many "you".
- Sometimes you see the rainbow after the rain.
This kind of "you" is omitted in Japanese.
- 雨の後には時には虹が出ることがあります。
(Which means something like "After the rain, the rainbow might be seen sometimes."
"you(あなた)" is weird in Japanese sentence, so the reader would find that the sentence was translated by some machines.
2) It doesn't analyse the context.
For example, the word "right" have many meaning:
- Turn the corner to the right.
- You have a right to live.
- Do the right thing.
- Her political stance is right.
To understand what it means, the dictionary is not enough.
You got to guess the meaning with the context and it is hard for the RBMT.
SBMT is statistical way.
You give the corpus (collections of huge amount of sentences with correspondance of Source/Target languages) to the system and it would learn what kind of source sentence matches what kind of target sentence.
For example, let's think you populated the following corpus to the SBMT machine.
e) Turn right to find the convenience store.
j) 右に曲がるとコンビニが見えます。
e) Turn left to find the gas station.
j) 左に曲がるとガソリンスタンドが見えます。
And if you ask the machine to translate the following sentence.
e) Turn right to find the gas station.
Then SBMT would probably give you the correct translation.
j) 右に曲がるとガソリンスタンドが見えます。
SBMT doesn't need what is subject, what is verb.
They just have to analyse how the EnJa sentences match.
The strong point is that (sometimes) it can create the quite fluent translation.
SBMT can learn that Japanese omits "you" from the translation.
If you want to use SBMT for the particular client, then you feed ONLY the documents of the client.
If your client was the map company, then the word "right" is likely the direction.
So if you feed more corpus, then the SBMT would be wiser, can guess the meaning upon context.
The weak point is that the quality would be worsen if you don't have much corpus.
SBMT don't have the theory so it cannot guess the meaning if it don't have the corpus.
So there are pros and cons for the both system.
Now, you might be curious that which system I would recommend.
I would recommend the third one --- Hybrid Machine Translation.
That is what my company is using :)
It knows about the grammer of both source language and target language.
First it analysis the source language (say it was English), what is subject, verb, object, complement, and so on.
It can recognize a noun clause, infinitive, so it would create the tree list as the result of analysis.
Then it translate each word, construct the target language (say it was Japanese), and then create the plain Japanese.
It's logical way and what the students do to translate the foreign language.
The strong point of this way is that you can translate any English if the sentence is clean, logical, and the MT has enough dictionary.
The weak point is as follows:
1) It translates literally.
For example, in English book, there are many "you".
- Sometimes you see the rainbow after the rain.
This kind of "you" is omitted in Japanese.
- 雨の後には時には虹が出ることがあります。
(Which means something like "After the rain, the rainbow might be seen sometimes."
"you(あなた)" is weird in Japanese sentence, so the reader would find that the sentence was translated by some machines.
2) It doesn't analyse the context.
For example, the word "right" have many meaning:
- Turn the corner to the right.
- You have a right to live.
- Do the right thing.
- Her political stance is right.
To understand what it means, the dictionary is not enough.
You got to guess the meaning with the context and it is hard for the RBMT.
SBMT is statistical way.
You give the corpus (collections of huge amount of sentences with correspondance of Source/Target languages) to the system and it would learn what kind of source sentence matches what kind of target sentence.
For example, let's think you populated the following corpus to the SBMT machine.
e) Turn right to find the convenience store.
j) 右に曲がるとコンビニが見えます。
e) Turn left to find the gas station.
j) 左に曲がるとガソリンスタンドが見えます。
And if you ask the machine to translate the following sentence.
e) Turn right to find the gas station.
Then SBMT would probably give you the correct translation.
j) 右に曲がるとガソリンスタンドが見えます。
SBMT doesn't need what is subject, what is verb.
They just have to analyse how the EnJa sentences match.
The strong point is that (sometimes) it can create the quite fluent translation.
SBMT can learn that Japanese omits "you" from the translation.
If you want to use SBMT for the particular client, then you feed ONLY the documents of the client.
If your client was the map company, then the word "right" is likely the direction.
So if you feed more corpus, then the SBMT would be wiser, can guess the meaning upon context.
The weak point is that the quality would be worsen if you don't have much corpus.
SBMT don't have the theory so it cannot guess the meaning if it don't have the corpus.
So there are pros and cons for the both system.
Now, you might be curious that which system I would recommend.
I would recommend the third one --- Hybrid Machine Translation.
That is what my company is using :)