Jump to content


Photo

UNICODE vs. UTF-8 vs. RootsMagic


  • Please log in to reply
6 replies to this topic

#1 Jerry Bryan

Jerry Bryan

    Advanced Member

  • Members
  • PipPipPip
  • 3929 posts

Posted 13 May 2020 - 10:02 AM

Not too long ago, I mentioned on these forums that RM's UTF-8 character encoding would not support the UNICODE encoding for many languages of the world. That was completely incorrect, and indeed RM's UTF-8 can support anything that UNICODE can support which is pretty much every language in the world. It was an honest but grievous error on my part. Here is some background on the issue.

 

Back in the earliest days of computing, computers had very limited character sets and could not even encode lower case letters. I believe before that there were character sets for teletype machines that could not even encode numeric characters and numbers had to be spelled out in English. Also, all the initial character sets for computing were English letters only. Gradually, these limitations were resolved where lower case letters could be encoded and then later so that letters from languages other than English could be encoded.

An early character set that could encode lower case English letters was ASCII - American Standard Code for Information Interchange. ASCII was truly an American Code. An organization called the International Standards Organization also developed character sets for other languages, especially for languages based on the Latin alphabet. These character sets were called ISOnnn where nnn was a three digit number. The code that corresponded to ASCII and American English was called ISO646 and I don't remember what the ISO code was called for French or German etc.

 

In any case, ISO recognized that a code needed to be developed that would support all the languages in the world, and the code that was developed was called ISO10646. ISO10646 was much criticized, even by the committee that developed it. It was considered to be messy and unwieldy. One characterization of ISO10646 was that a committee had been tasked to develop a horse and instead they came up with a camel.

 

I was a part of a group that reviewed ISO10646. I had nothing to do with the design, just the review. And the review committee was humongous - at least hundreds of people if not more. So the review committee was scarcely in mutual agreement, either, except that most everybody on the review committee thought that it had serious problems. The upshot was that ISO10646 was voted down, mostly by the same people who developed it (I didn't have a vote). I don't remember the exact date, but this was in the late 1980's.

 

One of the problems with ISO10646 was that it had variable length characters. Basic English characters would remain encoded as 8 bits. But many letters for some languages would be encoded as 16 bits and some letters for some languages would even be encoded as 32 bits. This kind of variable encoding would have made any kind text very difficult for computer software to process.

 

In the aftermath of the failure of ISO10646 came UNICODE which is what we have today. The legend is that the basic design of UNICODE was accomplished by two guys at a restaurant in Silicon Valley who in one night wrote their ideas down on a cocktail napkin. I think the legend is mostly true and is repeated to emphasize that committees of thousands are not good ideas. In any case, an important attribute of UNICODE was that all characters of every language would be represented by the same number of bits. UNICODE has evolved a bit since its beginnings, and in the current model it is a 32  bit code of which only 21 bits are used. As such it is wasteful of space. It is further wasteful of space if you have a lot of English text that can still be encoded in 8 bits.

The early days of UNICODE were the 1990's and that's where I lost contact with the project. The last I heard, there were going to be 8 bit and 16 bit versions of UNICODE that could be used when all the data would fit in either 8 bits or 16 bits. But that's not so. This is the source of my honest mistake in understanding UTF-8.

 

UNICODE is 32 bits per character. Period. What has happened since then is that pure UNICODE is seldom or maybe never used directly. Rather, encodings called UTF-8, UTF-16, and UTF-32 are used. To tell you the truth, I'm not sure what the difference is between UTF-32 and pure UNICODE since they are both 32 bit encodings. But in the case of UTF-8 and UTF-16, they are variable length encodings of full UNICODE and the design is very similar to the original "a camel is a horse designed by a committee" design of ISO10646. If UTF-8 is used to encode UNICODE which contains only English or western European letters, then each letter will be encoded as 8 bits and can be processed one byte at a time without dealing with variable length characters. But if UTF-8 is used to encode UNICODE which contains other characters, there is a likelihood that some of the characters will be encoded as multiple bytes. Software which processes character strings encoded in this way either have to deal with the variable length characters or else they have to call standard functions to convert the variable length strings to full UNICODE. Once the conversion has been accomplished, the software can deal with character strings with fixed length characters where all the characters are 32 bits in length.

 

I really don't know how RM deals with UTF-8 strings internally. It may convert them to pure UNICODE before processing them and then convert them back to UTF-8 before writing them to disk. I really don't know. But I will post a follow-up message to report on some experiments with RM where I have entered characters into RM that cannot be encoded using only 8 bits. RM does seem to support them just fine. So I was wrong. RM can stay with UTF-8 forever and be able to support every language in the world if it so chooses.

 

Jerry



#2 Jerry Bryan

Jerry Bryan

    Advanced Member

  • Members
  • PipPipPip
  • 3929 posts

Posted 13 May 2020 - 10:32 AM

One of the issues that troubled the ISO10646 reviewers was that each symbol in the character set had a name, and the and the reviewers could not agree on some of the names. For example, should the name of Ø be "Norwegian letter Ø" or "Norwegian character Ø". It actually makes a big difference sometimes. That's because most special characters are not allowed in things like email user names, but letters are always allowed in things like email user names. So having a symbol called a Norwegian letter vs. having it called a Norwegian character is the difference between it being allowed in an email user name or not. The same letter vs. character issue exists in many similar places throughout computing and data processing. The Norwegians lost this argument, by the way.

So I was surprised a couple of days ago when I saw a Facebook username which was made up of the strangest letters I had ever seen. I really had no clue what language they were from. And I was even more surprised that Facebook was allowing them in a username given that email and things like that will usually not accept them. So I copied and pasted the letters into Google Translate. They were Georgian. Who knew that the tiny little country of Georgia had an alphabet all its own? Certainly not me.

The fact that Facebook was allowing the Georgian letters in usernames got me to thinking about RM and UTF-8 and what characters RM would really accept. So pasted the letters into RM, both as the names of individuals and into RM's notes. The letters worked just fine both places. Well, when I say "just fine" I'm not so sure about whether the Georgians would like how their letters sort as names in RM. But at least RM accepts the letters even though I thought it wouldn't because of my complete misunderstanding of the relationship between UTF-8 and UNICODE. I looked at the Georgian letters in hexadecimal, and they are indeed correctly encoded variable length character UTF-8 strings which can be converted to fixed length character UNICODE strings. So all is well in RM with storing all the world's alphabets. What remains in RM for supporting multiple languages is the user interface, reports (especially sentence templates and source templates and the like), etc.

Just out of curiosity, here are a few Georgian letters I'm pasting in to see if they show up correctly in this forum: ასომთავრული

 

Jerry



#3 popmorgangd

popmorgangd

    Advanced Member

  • Members
  • PipPipPip
  • 42 posts

Posted 15 May 2020 - 04:35 AM

დიდი შრომა გერი. კარგად გაკეთდა.



#4 robertjacobs0

robertjacobs0

    Advanced Member

  • Members
  • PipPipPip
  • 338 posts

Posted 16 May 2020 - 09:51 AM

Google translate turns "Jerry" into "Gary," but the sentiment is right.



#5 TomH

TomH

    Advanced Member

  • Members
  • PipPipPip
  • 6435 posts

Posted 16 May 2020 - 11:21 AM

What I got is Gerry:
"Great hard work Gerry. Well done"

Tom user of RM7630 FTM2017 Ancestry.ca FamilySearch.org FindMyPast.com
SQLite_Tools_For_Roots_Magic_in_PR_Celti wiki, exploiting the database in special ways >>> RMtrix-tiny.png app, a bundle of RootsMagic utilities.


#6 KFN

KFN

    Advanced Member

  • Members
  • PipPipPip
  • 307 posts

Posted 16 May 2020 - 01:19 PM

By the way, very often we call it (ø) a “slash o” in English.



#7 Jerry Bryan

Jerry Bryan

    Advanced Member

  • Members
  • PipPipPip
  • 3929 posts

Posted 22 May 2020 - 08:36 AM

I should have done a few additional tests from the get go, but I have now tested various RM searching functions with Georgian letters. They all seem to work, including Find Everywhere.

 

Jerry