Not too long ago, I mentioned on these forums that RM's UTF-8 character encoding would not support the UNICODE encoding for many languages of the world. That was completely incorrect, and indeed RM's UTF-8 can support anything that UNICODE can support which is pretty much every language in the world. It was an honest but grievous error on my part. Here is some background on the issue.
Back in the earliest days of computing, computers had very limited character sets and could not even encode lower case letters. I believe before that there were character sets for teletype machines that could not even encode numeric characters and numbers had to be spelled out in English. Also, all the initial character sets for computing were English letters only. Gradually, these limitations were resolved where lower case letters could be encoded and then later so that letters from languages other than English could be encoded.
An early character set that could encode lower case English letters was ASCII - American Standard Code for Information Interchange. ASCII was truly an American Code. An organization called the International Standards Organization also developed character sets for other languages, especially for languages based on the Latin alphabet. These character sets were called ISOnnn where nnn was a three digit number. The code that corresponded to ASCII and American English was called ISO646 and I don't remember what the ISO code was called for French or German etc.
In any case, ISO recognized that a code needed to be developed that would support all the languages in the world, and the code that was developed was called ISO10646. ISO10646 was much criticized, even by the committee that developed it. It was considered to be messy and unwieldy. One characterization of ISO10646 was that a committee had been tasked to develop a horse and instead they came up with a camel.
I was a part of a group that reviewed ISO10646. I had nothing to do with the design, just the review. And the review committee was humongous - at least hundreds of people if not more. So the review committee was scarcely in mutual agreement, either, except that most everybody on the review committee thought that it had serious problems. The upshot was that ISO10646 was voted down, mostly by the same people who developed it (I didn't have a vote). I don't remember the exact date, but this was in the late 1980's.
One of the problems with ISO10646 was that it had variable length characters. Basic English characters would remain encoded as 8 bits. But many letters for some languages would be encoded as 16 bits and some letters for some languages would even be encoded as 32 bits. This kind of variable encoding would have made any kind text very difficult for computer software to process.
In the aftermath of the failure of ISO10646 came UNICODE which is what we have today. The legend is that the basic design of UNICODE was accomplished by two guys at a restaurant in Silicon Valley who in one night wrote their ideas down on a cocktail napkin. I think the legend is mostly true and is repeated to emphasize that committees of thousands are not good ideas. In any case, an important attribute of UNICODE was that all characters of every language would be represented by the same number of bits. UNICODE has evolved a bit since its beginnings, and in the current model it is a 32 bit code of which only 21 bits are used. As such it is wasteful of space. It is further wasteful of space if you have a lot of English text that can still be encoded in 8 bits.
The early days of UNICODE were the 1990's and that's where I lost contact with the project. The last I heard, there were going to be 8 bit and 16 bit versions of UNICODE that could be used when all the data would fit in either 8 bits or 16 bits. But that's not so. This is the source of my honest mistake in understanding UTF-8.
UNICODE is 32 bits per character. Period. What has happened since then is that pure UNICODE is seldom or maybe never used directly. Rather, encodings called UTF-8, UTF-16, and UTF-32 are used. To tell you the truth, I'm not sure what the difference is between UTF-32 and pure UNICODE since they are both 32 bit encodings. But in the case of UTF-8 and UTF-16, they are variable length encodings of full UNICODE and the design is very similar to the original "a camel is a horse designed by a committee" design of ISO10646. If UTF-8 is used to encode UNICODE which contains only English or western European letters, then each letter will be encoded as 8 bits and can be processed one byte at a time without dealing with variable length characters. But if UTF-8 is used to encode UNICODE which contains other characters, there is a likelihood that some of the characters will be encoded as multiple bytes. Software which processes character strings encoded in this way either have to deal with the variable length characters or else they have to call standard functions to convert the variable length strings to full UNICODE. Once the conversion has been accomplished, the software can deal with character strings with fixed length characters where all the characters are 32 bits in length.
I really don't know how RM deals with UTF-8 strings internally. It may convert them to pure UNICODE before processing them and then convert them back to UTF-8 before writing them to disk. I really don't know. But I will post a follow-up message to report on some experiments with RM where I have entered characters into RM that cannot be encoded using only 8 bits. RM does seem to support them just fine. So I was wrong. RM can stay with UTF-8 forever and be able to support every language in the world if it so chooses.