How Google Figured Out Khmer Translation

A screenshot of Google Translate page displaying the translation of "VOA" in both Khmer and English.

Editor’s note: Around Cambodian New Year last month, Google launched its online translation service for the Khmer language, making it the 66th language to be translatable on its service. Google says the launch is primarily aimed at making a vast amount of non-Khmer content on the Internet more accessible to Khmer speakers. Divon Lan, product manager of Google’s Next Wave Emerging Markets program, recently spoke to VOA Khmer's Sophat Soeung by phone to explain what it means for the average Cambodian. For the interview in Khmer, click here. ​

Your browser doesn’t support HTML5

Full Interview: Divon Lan talks about Google Translate Khmer

You didn’t actually use translators to build this system. How did you actually build Khmer translation on Google?

Google Translate is actually machine translation. Basically, the way this works is we look at all the Khmer data that is out there, on the Web and so on. And we figure out automatically the language model. And that allows us to translate not only from English to Khmer but actually from any language to Khmer. Today there are 66 languages on Google Translate, so, for example, you can go to a Chinese website or a French website and get it translated to Khmer and understand what it says and vice versa. Foreigners from many countries can read Khmer text translated into their languages. Now bear in mind that the machine translation is still not at the level of human translation. If you use Google Translate, what we’re aiming for is that you will be able to get the general idea of what a piece of text says. It won’t be a word-to-word translation. That’s the downside.

http://www.youtube.com/embed/Rq1dow1vTHY

At VOA Khmer we have two language websites, one in English and one in Khmer. Would that contribute to Google Translation because they basically have the same content? Is that the idea?

That’s the idea. We look at all the content that’s out there th​at is of that nature in two languages and looking at millions and millions of pages, and we try to figure out what the translation should be automatically using algorithms rather than humans.

A Cambodian student uses Google's new Khmer online translation service between Khmer and French. Google Translate released Khmer as its 66th language on its online translation service around Cambodian new year, 2013. (Courtesy of Divon Lan)

Khmer is the 66th language on Google Translate. It’s actually the last language of the (Lower) Mekong region, even after Lao. Is that because of the complexity of the Khmer language that it was released later on?

Yes, it’s partly because of the complexity. It’s partly because to create this machine, the “translation language model,” we need a fairly large amount of text available out there on the web. And Khmer is still, you know, the amount of text compared to other languages is still small if you compare it to other regional languages like Thai or Vietnamese. The amount of content in those languages is much bigger. We wanted to make sure that the quality meets our launch level, which is basically that you’d be able to understand more or less what an article is about although that translation is not perfect. The translation quality will improve over time. So the more people use it, the more people suggest corrections, and over time the quality will improve.

Actually the one language we launched just prior to Khmer was Lao, as you’ve mentioned. That’s also one of our most recent launches. And these languages are very similar in many of the difficulties that we face in translating. One of them is the fact that in Khmer and in Lao and in Thai, you don’t use white spaces to have a gap between words. So one of the challenges is just looking at Khmer text and figuring out where the word boundaries are, and the same challenge exists in Lao as well.

When you write in a language like English, you use a space between every word, whereas in Khmer the words are just stuck to each other. While a human reading Khmer can very easily tell the words apart, it’s actually quite difficult for a computer to understand where one word ends and a new word starts. One thing that makes it easier, by the way, is that Khmer has a unique script. So when we see Khmer letters in a document, we know for sure that’s Khmer, compared to Latin alphabets. Like you see a text, you’re not always sure if this is French or Italian or Spanish. It can be anything; they all use the same letters. For Khmer, if it is in Khmer, it’s Khmer. There’s no question. So that part makes it easy. And that’s the same for Lao and Thai.

Divon Lan, product manager of Google’s Next Wave Emerging Markets program, in Phnom Penh, Cambodia. (Courtesy of Divon Lan)

How long did it actually take you to build the Khmer translation and what was the most challenging aspect of doing that?

It took us a few good months, maybe even a year. And the challenges were really getting to a good enough level of quality, given the amount of Khmer text out there on the web is still relatively small. An additional challenge is that we found that actually people write Khmer words in many different ways. So there are a lot of people that don’t use a standard dictionary way to write a word. They just write it phonetically, and then we see many variants of different words, which of course adds another interesting technical challenge for us.

Who did you envision as your audience?

Our audience is primarily Cambodian. What we’ve seen in Cambodia is that the young and educated people in Phnom Penh are all using the Internet. But if you think about that, that’s only 5 percent or so of the population of the country. You have 90 to 95 percent of the people that are not using the Internet. Now there are many reasons why people are not using the Internet, like the cost of devices and things like that. But one of the top reasons in Cambodia is Khmer. Most Cambodians speak only Khmer, and I think it’s our duty as a technology industry—Google and other companies—to provide the world’s information to Cambodians in their language. The vast majority of content on the web is not in Khmer. It’s in English or other languages, and we think it’s critically important to give access to that information to all of the world’s information to Khmer-speakers, in their language. That’s the motivation here.

Some in our audience have asked if the Google Khmer translation is available on mobile.

Yes, it is.

A screenshot of Google Maps of Cambodia, displaying in the Khmer language on Friday, May 24, 2013.

Google Khmer was launched more than two weeks ago. What feedback you have gotten so far?

I think the feedback is extremely enthusiastic. I’m following the English-language media in Cambodia, and thanks to Google Translate I now can also follow the Khmer-language media, since I’m not a Khmer-speaker. I’m very excited to see the level of excitement out there. Obviously, there are also comments about the quality, which is to be expected, but I think everybody recognizes that this is really a big step for the Internet in Cambodia.

So this is the early beginning for Khmer Translation. What is the plan? What’s coming up related to this?

This is what we call an “Alpha version,” which means it is the very very first early version of that translation. We hope the quality will improve a lot. We’re investing more and more in Khmer. For example, Google Maps already shows place names in Khmer language. This happened a few months ago. I can’t provide specific details on future plans, but I can say that we’re definitely investing in the Khmer language, because our objective is really to get the world’s information to Cambodians. So every Cambodian—doesn’t matter their background, where they are, whether or not they speak other languages—they should be able to participate in this information revolution that we are in.

On a personal note, I’ll say that my wife is actually Khmer. And when I think about our mission in Cambodia—as Google and even as the technology industry—I always think about my mother-in-law, who is an intelligent, capable woman, but who, like most Cambodians, only speaks Khmer. And because of that, she is not able to access the Internet, access the information that’s out there. And it’s my personal mission to solve that problem. My mother-in-law should be able to use the Internet just like anybody else, in her own language.