Friday, November 11, 2005

Vietnamese Revisited

[Warning! This post suffers from inaccurate terminology. When I wrote different 'encoding', I really meant different 'character sequences.' ]

Last June I posted about how not to help someone do a google search in Vietnamese. It was one of my more frustrating experiences and I just walked away and forgot about it. However, Mike has made me think about it again. I went back and visited my post and saw what was wrong with it. So I am giving it another kick at the can.

I was asked to help a Vietnamese speaking social worker to do an internet search in Vietnamese. He said "You don't need the accents - just use the English keyboard." We tried that and got some hits. I didn't have a Vietnamese keyboard at that moment, so we went to VietDic and using that got an encoding for our search term in Vietnamese - many hits.

At home I tried the Microsoft Vietnamese keyboard. Not so many hits.

So once again here is my experiment from last June - updated.

First, I am using google:images results. The term bãi biển means beach. If I get pictures of beaches preferably in Vietnam I consider that a good hit.

Here is the test with terms displayed this time in Arial font.

1. VietDic site - bãi biển 654 hits

2. Microsoft Vietnamese keyboard - bãi biển 5 hits

3. Combining accents only - bai biên 473 hits (Not all beaches)

4. No accents - bai bien 207 hits (Not all beaches)

Okay, so I don' t speak a word of Vietnamese but it does seem that something is not right with the MS Vietnamese keyboard. There must be two different encodings that look identical and no normalisation in the search engine. If anyone can explain this I would be interested in hearing the story.

Update:

Mike says in the comment section that "individual standards that cannot represent other languages are an evolutionary blind alley -- as is deciding the best encoding for a language by measuring google hits! :-)"

I accept your point, Mike, I won't defend using google hits to prove anything. They are basically for fun. With an image I know I have beaches. With a different encoding I still have beaches but not the same beaches. I concede this point.

However, what is meant by "individual standards that cannot represent other languages are an evolutionary bind alley"? Maybe someone thinks that I am not using Unicode. I wouldn't know how not to use Unicode.

Here are my Unicode codepoints (upgraded from my comment section) for the 'ể' in biển.

#1 'ể' is one character U+1EC3 : LATIN SMALL LETTER E WITH CIRCUMFLEX AND HOOK ABOVE

#2 'ể' is a two characters U+00EA : LATIN SMALL LETTER E WITH CIRCUMFLEX *and* U+0309 : COMBINING HOOK ABOVE

They are both Unicode aren't they, but shouldn't one of them be the standard? Who sets the standard?

Update #2: Sorry, I have not been using the right terminology. So I hope nobody thinks that this is a techblog. Instead of saying 'encodings', which indicates Unicode and some other encoding standard, I should say two different 'character sequences'. (There is a paper about all this terminology, &c.)

And it turns out that the two sequences for 'ể' are canonically equivalent. Thanks Andrew C. However, the normalization that should occur for 'canonically equivalent character sequences' doesn't appear to work in either google or yahoo.

Update #3: Two people have recommended alternate Vietnamese keyboards. Unikey and VPSkeys. Great! Thanks Michael for your description of Unikey.

Comments continue on Mike's post.

10 Comments:

Anonymous Tim May said...

Well, the clear difference I can see here is that in the VietDic output, the letters ã and ể are precomposed Unicode characters, whereas in the MS keyboard output they're multi-character sequences - letters followed by combining diacritical characters. These are probably supposed to be equivalent sequences for searches etc. according to Unicode - I forget the term used - but I guess Google isn't fully supporting that.

6:35 PM  
Blogger Simon said...

To verify the difference between the versions of the text, have a look at the encoded search URL.
For example,
http://images.google.com/images?q=b%C3%A3i+bi%E1%BB%83n
shows that this search uses precomposed characters (UTF-8 encoding). The "e" appears to be from Latin Extended B(?).

7:47 PM  
Anonymous Suz said...

Hi,

#1 is all precomposed using U+1EC3 : LATIN SMALL LETTER E WITH CIRCUMFLEX AND HOOK ABOVE

#2 is a mix using U+00EA : LATIN SMALL LETTER E WITH CIRCUMFLEX and
U+0309 : COMBINING HOOK ABOVE

But my question really is about whether there is a standard, and is one encoding going against the standard while the other one is standard?

9:26 PM  
Anonymous Suz said...

Hi Tim,

I just went back and reread your comment. I got half way through and quit to look at the codepoints myself.

So now I understand the codepoints, you are sayig that google might be able to make these equivalent sequences for searching? Can that be done?

10:49 PM  
Anonymous Andrew said...

Microsoft's keyboard layout uses precomposed characters for the vowels and combining diacritics for the tone markers. Partly, I guess this is to preserve backwards compatability with older non-Unicode Windows-1258 applications.

Either way, they are canonically equivalent.

The key issue is that Google, like may web services does not bother to normalize Unicode strings. Google seems to take it byte by byte. The result is that the microsoft layout compared to a precomposed (NFC) string or even a NFD string produces different results.

The W3C have released a draft version of part of their character model that tackles normalization. http://www.w3.org/TR/2004/WD-charmod-20040225/

To add to the confusion, there are non-Unicode Vietnamese sites still out their as well. So not only do you require 2 unicode searches, there are also searches in VISCII, WIN-VNI, VPSWin and TCVN.

Have you tried the same search in Yahoo. I'd suspect differnet results there.

Andrew

12:43 AM  
Anonymous Suz said...

Hi Andrew,

Thanks for your very specific help, as always. I did just try yahoo now (image search) and even worse - I got no hits for #2 the Microsoft keyed term; but I did get 373 hits for #1 from the VietDic. And no I haven't tried any of the non-Unicode encodings.

So I guess neither yahoo nor google are handling equivalencies. However, if you say they are canonically equivalent that explains the keyboard. I will try to look up the paper on normalization some time since I am interested in other languages that might use it.

1:06 AM  
Blogger Michael Farris said...

Not exactly your comment, but for Vietnamese, I use a non-microsoft keyboard called unikey. It has several options, I use unicode precomposed characters and telex input, a vietnamese system that takes a little getting used to. Here's a list of some words, with the input on the left. output in the middle and English gloss on the right.

vieejt Việt Vietnamese
ngwowfi người person
tooi tôi I
owr ở at
sawsp sắp imm. future marker
ddax đã past marker

the tone keys (f, s, r, x, j) can be typed either after the vowel or after all the segmental letters of the word have been typed. The latter method is probably better as it assigns the tone marker better in ambiguous cases (but I'm used to writing tone as I go along). It's much faster than when I inputted a 100 or so pages of dictionary entries using keyboard shortcuts of my own devising in a floating accent system that I hate with a passion now (can you saw awkward and time consuming and frustrating?)

11:06 AM  
Anonymous Suz said...

Hi Michael,

That is a neat work around. I would think kinesthetically it would be so much easier and faster to use letters to type accents! Great. Unikey - I'll remember that.

11:32 AM  
Blogger Simon said...

Actually Google makes the effort to normalise the search strings.

For example, for Greek, Google knows about cases (does case mappings):
http://www.google.com/search?q=ιστολόγιο
http://www.google.com/search?q=ΙΣΤΟΛΌΓΙΟ
http://www.google.com/search?q=ΙΣΤΟΛΟΓΙΟ
http://www.google.com/search?q=ιστολογιο
and also can work irrespective of accents! This might come from the case mapping rules for Greek; when you capitalise words, the accents are often removed.

For more, see
http://www.unicode.org/reports/tr21/tr21-5.html

2:17 PM  
Anonymous Andrew said...

As Simon has indicated, Google has put a lot of work into some languages to optimise searching in those languages.

But if you use a language they haven't optimised for, you tend to have problems.

as far as I can tell, Google seems to operate on byte sequences rather than character sequences.

One trap people fall into is the assumption that because Google has an interface translated into a langauge, then Google is a suitable search tool for that language.

Recently, I've been researching Khmer search engines. The Google interface has been translated into Khmer, but it doesn't seem to be possible to actually search sucessfully in Khmer unicode, even though there are Khmer unicode sites that have been indexed by Google.

3:36 AM  

Post a Comment

<< Home