Saturday, November 05, 2005

Dialect Character Sets

I recently read this post on Mike Kaplan's blog. Martin Kochanski says,

Before Unicode was as widely used as it is now, users of languages with diacritics had to manage with ASCII (or if, they were lucky, with Latin-1) and whole dialects of character usage grew up as a result. This was especially the case with informal communications such as chats and bulletin boards.

To give the example I know best: Polish needs acute accents on c, s, and z, a dot on the z, tails under a and e, and a line through the lowercase "l", to mention just a few.

Sometimes the accents were left out when they could be inferred, and some adjustments were trivial (eg. represent acute accent with a following apostrophe) but what was really inspiring was that people worked out that some letters that weren't used in Polish, such as q, v and x, could be co-opted and given consistent meanings in Polish completely unrelated to what they normally mean in Latin scripts: thus if x equalled z-dot (I can't whether this was one of the specific equivalences) then a Polish speaker would quickly learn to read x as z-dot without hesitation and to press the x key when he wanted to type z-dot.

The spontaneous evolution of such dialect character sets (the convergent evolution resulting from a strong selection towards mutual comprehensibility) has always struck me as a rather inspiring episode, because "bottom-up", driven by need, and not created by committees. The trouble is that once the need disappears, so do the dialects.

I'm hoping that someone somewhere is interested enough in the electronic equivalent of "oral history" to be able to capture and codify these ephemeral character sets before they are forgotten even by the people who used them; and it struck me that some of the people who read this blog might have an interest in this bit of history too.

Mike asks, "Now this is a fascinating topic, but one that I have to admit I know just about nothing about. Does anyone know of a place where knowledge all of these kinds of de facto standards might be kept?"


Blogger michael farris said...

I live in Poland (for over ten years now) and have never seen any alternate character sets for Polish.
Around 90-91 in the US my Polish teacher got an email newsletter in ASCII and sometimes gave me the homework job of adding in diacritics. Mostly not very hard even when I didn't know the words (very often).
IME when Polish special characters aren't available Polish people use ASCII period so that the following forms (wife and zone respectively in the nominative and instrumental cases)

żona, żoną, zona, zoną are all written zona.

It's really not that confusing (in moderation in context)

That's been my experience with other diacritic heavy languages too, esp. Hungarian and even Vietnamese. A Vietnamese co-worker wanted to use my mailbox to write her husband in Hanoi (family business she wanted to keep off the home computer) and she just wrote ASCII. Later, she used it to write to me some too and I found it generally not that hard to read (though I imagine she exercised some specific word choices and added in some redundancy).

The only language I know of that's come up with alternate sets that are used on a comparitively wide scale is Esperanto, where the circumflex over g, for example is often written gh, gx, g^, ^g or g' (there may be other systems to, but I don't know them).

Actually I think a system was worked out for Vietnamese too, but seemed to be restricted to speakers born outside of vietnam and tied into conversion programs (so you either saw the converted text but could still make out the original).

1:08 AM  
Blogger Gary Feng said...

This is proably a digression from what Suzanne meant by "dialect char sets": Cantonese has created hundreds of "non-standard" characters that are not used in the standard Chinese writing system. See

-- gary

8:59 PM  
Anonymous Anonymous said...

Hi Gary,

It usually doesn't matter what I meant. This blog is so non-linear. I had never really thought much about these 'non-standard' characters for Cantonese but I am extremely interested in non-standardized orthography of any kind. So thanks.

BTW don't you think that some dyslexics might have difficulty with this word verification system!

9:08 PM  
Anonymous Anonymous said...

Hi Suzanne,

This is a topic I find interesting as well. Aside from the Esperanto example Mike mentioned above, I can think of three examples:

Brazilian Portuguese Brazilians have a fairly consistent de facto orthography for avoiding the very common accent marks in Portuguese:

Here's a random sentence I just plucked off some web page and how it would probably be rendered in an instant message conversation, for instance:

Os inventores não pensam em comerciar a bolacha apenas para fazer pedidos --eles admitem que essa não é uma tarefa difícil para os freqüentadores de bares.

The bold words would be rendered as: naum, eh, dificil, frequentatadores. Interestingly, it seems to be the case that the substitutions are only made word-finally.


I know very little of Arabic myself, but I seem to remember there being a name for ASCII-ized Arabic referred to as something like "khawarja" or "khawaja"... I really can't recall. But this system is fairly common on message boards and the like. Numerals are used to indicate the distinctly Arabic letters such as 'ain. (I believe "7" and "3" are particularly common in this regard.)

Scandinavian languages

For this one I could have sworn there was a FAQ out there somewhere about orthographic conventions such as A{ standing for Å and so on... I remember because I found it incredible painful to look at, heh. I will try to dig that up.

This is a really interesting topic, I expect I will blog about it at some point.

10:17 PM  
Anonymous Anonymous said...

So now we have a pre-1990 date for the ASCII Polish alphabet - and we also have me kicking myself for not making a proper note of it when I saw it! Too many linguistic phenomena of this kind disappear because they are too obvious at the time and no-one can imagine that they'll ever be forgotten.
I liked the Portuguese example. Spanish is luckier than Portuguese because all the accented letters except ñ appear on the default keyboard nowadays (and I think it is rather mean that AltGr+N doesn't do what it obviously ought to do). Everyone gets a laugh out of answering ¿cuantos anos tienes? with ¡uno! once or twice and then gets bored with the joke.
I've seen someone in an outdoor web café in Kraków typing at high speed into an IRC channel in Vietnamese but I think he just wasn't bothering with accents at all.

2:19 PM  

Post a Comment

<< Home