Character Encoding
Previous Topic  Next Topic 


Q. What are Encoding Errors?


A.  Stat/Transfer stores strings internally in Unicode, which is capable of storing all of characters in all languages, plus many, many other symbols.  Most older character sets are of much more limited scope.  For instance, the most common encoding, ASCII, is only capable of storing a handful of symbols, letters and numbers, since it has only 127 locations for characters and control codes.  Other single-byte character sets double the amount of storage and allow accented characters and other useful symbols.  There are a number of such single-byte character sets, for instance one is suitable for the Cyrillic alphabet and another for modern Greek.


When Stat/Transfer reads data, it converts it to Unicode either based on the settings for character sets in the encoding options, or information written in the input file.  If you file does not have information on the encoding, and is in a character set that is not the default encoding used on your computer, you must tell Stat/Transfer which encoding to use.  For instance, if a Greek colleague sends you a Stata dataset, you may need to select a Greek character set in order to properly read it and translate it to a Unicode based system such as Excel.  If the dataset contains non-ASCII strings and you do not set the encoding properly, you will get nonsense on output.

 

Because all single byte characters can be mapped to Unicode, there are seldom errors on input. However, you might encounter them if you are reading multi-byte characters such as those for Japanese.


The most common problems occur on output, when sometimes a character that was read on input has no mapping to the output character set.  For instance, if you read your Greek data set and attempted to write it to SAS, using your Western European machine default, there would be many encoding errors because Greek characters in Unicode cannot be mapped to a character set such as latin1.


Some problems are more surprising because it looks as if you are dealing with ASCII, but your file has some characters that cannot be represented in the output.  For instance all Microsoft applications use Unicode and characters such as the left apostrophe cannot be mapped to common non-unicode character sets.  The same is true for the Euro sign, which is not present in ISO-8859-1, but is present in its more modern replacement, ISO-8859-15.  If any of these characters are present, you may well get an encoding error.


Q.  When don't I have to worry about this?


A.  Most statistical data are numbers and ASCII letters.  These are properly and unambiguously represented in all character sets.  If that is what is in your data, you will not encounter encoding errors.


Also, if you are reading from or writing to a Unicode based format, the problem disappears because every character can be properly represented and there is no ambiguity.  Excel and Access are in this class as well as versions of SPSS higher than 17.  SAS Versions 9.1 and above store information about the encoding in the file and Stat/Transfer can use that to set the encoding.  Anytime that encoding information is stored in the file, this information takes precedence over your option settings.


Q. How do I fix or work around "Encoding Errors"?


A. If the problem arises from an inappropriate character set, either for input or output, you should change the character set on the Encoding Options section of the Options dialog box.  For instance, if you know your data contains Cyrillic characters, simply pick an appropriate character set.


If you think the problem may be due to sporadic characters that cannot be translated, choose the ‘Substitute’ option rather than ‘Stop’. In that case, an underscore will be substituted for characters which cannot be converted.


Sometimes, if a transfer is necessary out of a Unicode application such as SPSS or Excel to a non-Unicode application, and the dataset contains strings in many languages, that require different encodings on output, there is simply no way to work around the problem. In that case, simply set the number of permitted substitutions to a high number and proceed with your transfer. You will then see underscores for characters that cannot be represented in your output character set.


Q. What about Stata Version 13 and earlier?


A.  Stata before version 14 did not support Unicode and did support other multi-byte character sets, such as those necessary for Far Eastern Languages. If you are working with a data set in which all of the strings are in a language that can be represented by single byte characters (all European languages) just choose the appropriate output encoding. However, if your dataset contains strings in Far Eastern languages or multiple languages that use different character sets, you will simply not be able to properly represent all of the strings and will need to live with underscores in your data.


We highly recommend that Stata users upgrade to get the benefits of Unicode.  Now Stat/Transfer can transfer data in any language to Stata.