Sunday, November 13, 2005


Here are some responses to the Vietnamese search problem that focus on the search engine and not the keyboard. I think this is an issue that anyone who is searching the internet needs to be aware of.

First, Andrew C. commented,

The key issue is that Google, like many web services does not bother to normalize Unicode strings. Google seems to take it byte by byte. The result is that the microsoft layout compared to a precomposed (NFC) string or even a NFD string produces different results.The W3C have released a draft version of part of their character model that tackles normalization.

Then Simon reponded,

Actually Google makes the effort to normalise the search strings.For example, for Greek, Google knows about cases (does case mappings):ιστολόγιοΙΣΤΟΛΌΓΙΟΙΣΤΟΛΟΓΙΟιστολογιο and also can work irrespective of accents! This might come from the case mapping rules for Greek; when you capitalise words, the accents are often removed. For more, see

Then Andrew C. continued,

As Simon has indicated, Google has put a lot of work into some languages to optimise searching in those languages. But if you use a language they haven't optimised for, you tend to have problems. As far as I can tell, Google seems to operate on byte sequences rather than character sequences. One trap people fall into is the assumption that because Google has an interface translated into a langauge, then Google is a suitable search tool for that language.

Recently, I've been researching Khmer search engines. The Google interface has been translated into Khmer, but it doesn't seem to be possible to actually search sucessfully in Khmer unicode, even though there are Khmer unicode sites that have been indexed by Google.

I also know that I don't need accents to google in French. And this week I have been busily working away on my own little project on Andreas Müller (1630-1694). 'Muller', 'Müller' and 'Mueller' all give me the same search results. After a little testing it seems that the precomposed accents - acute, grave, cirmumflex and umlaut are normalized. However, maybe not the combining diacritics or even precomposed letters with two diacritcs. Hmm. I can't really say.

However, here is another little problem - when I get to the page I want and use the edit:find feature, I have to be exact and use every little accent. I have to search the page using Muller, Müller and Mueller as separate searches. No normalization there! I wondered why all those pages gave me no results.

Well, Müller is not going anywhere so I can catch up with him now.

Additional Comments:

On another topic altogether, I don't have time to quote and comment on the many great posts that I read. I assume that if they are in my sidebar people will find them eventually.

However, here are a few things worth mentionning. First, Andrew West has made his first post Tibetan Extensions 1 : Astrological Pebble Symbols on his new BabelStone blog. Then there is Lameen Souag's post on A comparative linguist of the 10th century and finally the ongong discussion of the Tel Zayit Alphabet on Language Log.

Update #1: See Mike's post for a more refined search engine experiment.

Update #2: See further comment here .


Anonymous Anonymous said...

These search engines have *got* to start normalizing....


3:48 AM  

Post a Comment

<< Home