Frequently Asked Questions: Character Encoding

Q.What are “Encoding Errors”?

A.Stat/Transfer stores strings internally in Unicode, which is capable of storing all of characters in all languages, plus many, many other symbols. Most older character sets are of much more limited scope. For instance, the most common encoding, ASCII, is only capable of storing a handful of symbols, letters and numbers, since it has only 127 locations for characters and control codes. Other single-byte character sets double the amount of storage and allow accented characters and other useful symbols. There are a number of such single-byte character sets, for instance one is suitable for the Cyrillic alphabet and another for modern Greek.

When Stat/Transfer reads data, it converts it to Unicode either based on the settings for character sets in the encoding options, or information written in the input file. If you file does not have information on the encoding, and is in a character set that is not the default encoding used on your computer, you must tell Stat/Transfer which encoding to use. For instance, if a Greek colleague sends you a Stata dataset, you may need to select a Greek character set in order to properly read it and translate it to a Unicode based system such as Excel. If the dataset contains non-ASCII strings and you do not set the encoding properly, you will get nonsense on output.

Because all single byte characters can be mapped to Unicode, there are seldom errors on input. However, you might encounter them if you are reading multi-byte characters such as those for Japanese.

The most common problems occur on output, when sometimes a character that was read on input has no mapping to the output character set. For instance, if you read your Greek data set and attempted to write it to SAS, using your Western European machine default, there would be many encoding errors because Greek characters in Unicode cannot be mapped to a character set such as latin1.

Some problems are more surprising because it looks as if you are dealing with ASCII, but your file has some characters that cannot be represented in the output. For instance all Microsoft applications use Unicode and characters such as the left apostrophe cannot be mapped to common non-unicode character sets. The same is true for the Euro sign, which is not present in ISO-8859-1, but is present in its more modern replacement, ISO-8859-15. If any of these characters are present, they create a potential for encoding errors.

In order to simplify matters, the most current builds of Stat/Transfer substitute characters very freely, for instance an é will go to e if the accented character is not presnt and an € will go to "EUR".

Q. When don’t I have to worry about this?

A. Most statistical data are numbers and ASCII letters. These are properly and unambiguously represented in all character sets. If that is what is in your data, you will not encounter encoding errors.

Also, If you are reading from or writing to a Unicode based format, the problem disappears because every character can be properly represented and there is no ambiguity. Excel and Access are in this class as well as versions of SPSS higher than 17. SAS Versions 9.1 and above store information about the encoding in the file and Stat/Tranfer can use that to set the encoding. Anytime that encoding information is stored in the file, this information takes precedence over your option settings.

Q. How do I fix or work around "Encoding Errors"?

A. If the problem arises from an inappropriate character set, either for input or output, you should change the character set on the Encoding Options section of the Options dialog box. For instance, if you know your data contain Cyrillic characters, simply pick an appropriate character set.

If you think the problem may be due to sporadic characters that cannot be translated, choose the “Substitute” option rather than “Stop”. In that case, a visually similar character will be substituted if possible and underscore if not.

Sometimes, if a transfer is necessary out of a Unicode application such as SPSS or Excel to a non-Unicode application such as Stata, and the dataset contains strings in many languages, that require different encodings on output, there is simply no way to work around the problem. In that case, simply set the number of permitted substitutions to a high number and proceed with your transfer. You will then see reasonable substitutions or underscores for characters that can not be represented in your output character set.

Q. What about Stata?

A.Unfortunately Stata does not support Unicode and does not support other multi-byte character sets, such as those necessary for Far Eastern Languages. If you are working with a data set in which all of the strings are in a language that can be represented by single byte characters (all European languages) just choose the appropriate output encoding. However, if your dataset contains strings in Far Eastern langages or multiple languages that use different character sets, you will simply not be able to properly represent all of the strings and will need to live with underscores in your data.

You might also want to let the Stata people know that Unicode support is important to you.

Q. Where can I learn more about character encodings?

A. As always Wikipedia is a good place to start. See this Index to their articles on character encoding. A lucid discussion of the problems solved by Unicode can be found here.

Q.This is too complicated and it is more than I want to cope with. Can you help me?

A Yes, we can, but it soaks up resources that are best used for improving Stat/Transfer. If you want us to choose your encoding for you, we will do so for a flat $250 fee. You will need to send us your input file and describe your problem in detail. If we can't solve the problem, your payment be refunded.