Encoding Options
Previous Topic  Next Topic 

Introduction to Character Encoding

In the early days of computing, characters and numbers were represented by seven (and sometimes fewer) bits.  The most popular of these encodings, ASCII, allowed the representation of upper and lower case letters, numbers and various symbols.  Later, different eight-bit relatives of ASCII were developed, which allowed the encoding of an additional 128 characters, such as accented letters and various other symbols. For languages such as Japanese, which required many more characters, other multi-byte character sets were developed.  This resulted in different, incompatible character encoding systems, even for the encoding of a single language's character set.  Fortunately, order is now being restored to this Babel with the proliferation of Unicode, a means of uniformly representing the characters of all languages.


In Stat/Transfer 13, all characters are represented internally in Unicode.  It is thus capable of handling strings in any language.  However, many of the file formats that Stat/Transfer supports were written in the days in which ASCII was the way of representing strings.  Thus, when read, they will not give Stat/Transfer information on the code-page that is in use and when they are written, they have no place for Stat/Transfer to store the code page that was used.  

3


For some file formats, the user will not need to worry about analysis of this.  Excel '97 and above, for instance, stores characters in Unicode.  If you are going to Access or SAS Version 9 or above, no conversion will be necessary as both of these are Unicode aware. 


Your computer's operating system maintains a setting of the current working code page.  In most cases applications will use that information to encode characters in a consistent manner.  However, if someone in Japan, or Greece, or Russia sends you a file, you will need to tell Stat/Transfer the encoding that was used to write their files.


The options on this page provide a way for the user to specify, when necessary, the encoding that is present in an input file and the one that is desired in an output file.


Input Character Set

In general, unless you are sure of what you are doing, you should leave this option at the default setting, Use current system default code page.   If that is checked, you will be able to see which code page is in use on your computer. That code page will then be used for files that that do not have known encoding information.  If the encoding of an input file can be determined from the contents of the file (or its format), that will, of course, override this setting. 


If you choose to override the default behavior, you can choose a different encoding by first selecting the Region and then the Character set.  Because Stat/Transfer represents characters internally in Unicode, any character set can, in principle be converted on input.   However, if you select an incorrect character set, what you will get is likely to be nonsense.  Therefore we strongly urge you to look carefully at your data in the viewer to make sure that all is well and to make sure that the font you have selected for the viewer is capable of displaying the characters you need.


Output Character Set

Unless you are sure of what you are doing, you should leave this option at the default setting, Use current system default code page.   This is especially true for the output character set because any input character set can be converted to our internal representation, Unicode.  However, you have to get it just right when going from Unicode to another single or multi-byte character set or you will be guaranteed to get total nonsense.  For instance, if you have read a Japanese file and then want to convert it to a multi-byte Chinese character set, Stat/Transfer will simply stop in its tracks since this is impossible (unless the only characters consist only of  numbers, simple punctuation, and the letters A-Z.) 


For file formats that are Unicode-aware (e.g. Excel and SAS 9+), Stat/Transfer will write Unicode, regardless of how you set this option.


On Encoding Errors

Unicode can represent any character.  Unfortunately the same cannot be said for other character sets.  When Stat/Transfer moves character data from its internal Unicode representation to a character set that can represent fewer characters, there is some probability that some will not fit.  For example, Microsoft applications such as Excel store characters in the  Unicode representation and in some cases, although it looks like the contents of the data can be represented in a Western European character set, there are some characters that cannot be encoded.  These include the right and left apostrophes and, often, the Euro sign.


The default behavior when this occurs is Substitute.  With this option, Stat/Transfer will substitute characters for those that cannot  be converted.  For example, it will substitute a single quote for right and left apostrophes, and, if necessary, non-accented letters for those with accents.  If no substitution is possible, underscores will be substituted.    If you do not want any substitution performed, you can check the option Stop, which directs Stat/Transfer to stop on the first conversion error encountered. 


The error limit will limit the number of permitted substitutions.  The default is 100. 


Note: If your Western European data have Euro and other currency signs, a good choice for your single byte output character set is ISO-8859-15, which is a more modern version of ISO-8859-1.