SCHEMA Files for ASCII Input
Previous Topic  Next Topic 

The Schema file is used to describe an input ASCII file to Stat/Transfer.


If you are using a fixed format ASCII file that was not created by Stat/Transfer, then you must create the Schema file before you read a fixed format ASCII file into Stat/Transfer.


If you are using a delimited ASCII file as input, you may use a Schema file to specify variable names, labels and missing values in a more detailed way than is allowed by the automatic method described in the section Delimited ASCII Files.


The Schema file should be a plain ASCII text file with the extension .sts (Stat/Transfer Schema) or .stsd .(Stat/Transfer Schema - Delimited).  You can create it with any text editor that will create plain text.


It is most straightforward if your ASCII data file has the same name as the Schema file, with the extension .dat, and is located in the same directory as the Schema file.  If this is not the case, you can override this default with the FILE command, as discussed below.


Note that the Schema syntax is similar to SPSS DATA LIST syntax, but is more convenient to use.  There is no MISSING VALUES command, because missing values can be defined along with the variables.  Similarly, variable labels can be defined alongside the variables. Value labels are attached with tags in a manner that is closer to SAS and Stata than SPSS.


Creating a Schema File

The structure of the Schema file is the following:


FILE  file specification         when required

FORMAT [fixed] | delimited  delimiter

FIRST LINE  n       when required

VARIABLES

       /         when required

       Variable name | variable list | variable range   columns  (variable type)    Fixed

       Variable name [variable number]  (variable type)     Delimited                                      

       Optional elements for the VARIABLES command

                [ Missing value specs ]

                { Variable label } }

                \ Value label tag

Optional commands

VARIABLE LABELS

      Variable name   variable label

VALUE LABELS

       \label tag

               value  label

                    

               value  label

DATA            when required




Each of the major commands (FILE, FORMAT, FIRST LINE, VARIABLES, VARIABLE LABELS, VALUE LABELS, and DATA) must begin in the first space on a line.  All other elements must be indented at least one space or tab.


Commands in the Schema file are not case sensitive.


For fixed format data, the most basic Schema file consists of a list of variable names, followed by their column locations and, for non-numeric variables, the variable type in parentheses.


For delimited data, the most basic Schema file consists of a list of variable names and the types, in parentheses, of all variables.


All of the elements of a Schema file are described below.


FILE Command

If your ASCII input file is not in the same directory as your Schema file, if it has an extension other than .dat, or has a name that is different from the Schema file, you must use the FILE command.


This consists of the word FILE beginning in the first column, followed by the complete path and name of your input data file.  If there are embedded spaces in any of these elements, you must enclose the file specification in single or double quotes.


For example:


      File "/my data files/new/big.file".


If the ASCII input file is in the same directory as the Schema file, with the same name and with an extension .dat, then the FILE command does not need to be used.


The FILE command is also omitted if you choose to put your input data in the same file as the Schema commands, using the DATA command (see below).


Note that when Stat/Transfer writes a Schema file for an output fixed format or delimited ASCII file, the output file specification will be written to the FILE command  using the same drive (Windows), directory and name as the input file and appending the file extension .dat. You can tell Stat/Transfer not to use this path, but instead to use only the default directory '.' and the file name.   To do so, uncheck the option Write Complete Paths in the ASCII/Text File Write section of the Options dialog box or use the SET command CODE-OMIT-PATHS Y with the command processor.


FORMAT Command

The format command is optional for fixed format data, but it must be present in order to read delimited data.  It begins in the first column.


When reading fixed format data, the FORMAT command, if present, is followed by 'fixed'.  This command is never required for fixed format files, but you may wish to use it for documentation purposes.


When reading delimited data, the FORMAT command is followed by 'delimited', which is then followed by the delimiter.  The delimiter can be 'commas', 'tabs', 'spaces' or 'semicolons', or you can specify any other delimiter character by preceding it with a '\'.  For instance to use the pipe character for a delimiter, type


       format delimited \|


To use commas, type


       format delimited commas


FIRST LINE Command

If you need to skip lines at the beginning of your data file, use this command, which starts in the first column. After the command, enter the number of the first line to be read.


For example, to start on the third line of your file, type


    first line 3




VARIABLES Command

The VARIABLES command must be used.  It begins in the first column.


Record Specifiers

If you have more than one record for each logical case, you must indicate the start of each new record by a slash.  For example, if each case is on two records, then the VARIABLES command would have the following form:


VARIABLES

       variable specifications for record one

       /

       variable specifications for record two


where '/' signifies the start of the second record.


You must specify as many records as are present, but you need not define fields on each record.


Variable Specification for Fixed Format Data


Variable Names

For fixed format input, the required elements after the VARIABLES command are a list of variable names, followed by their column locations and, for non-numeric variables, the variable type in parentheses.


Variable names must begin with a letter or an underscore.  If they have embedded spaces (not recommended) they must be enclosed by single or double quotes.


The column locations begin with column one.  The variable type, when necessary, is given in parentheses after the column location.


For example, a Schema file might contain the following VARIABLES command:



Variables



   ID

1-5


   Age

6-7


   Name

8-20 (A)


   Sex

21


   Birthdate

(%d-%m-%Y)

  

The type is not necessary when the variable type is numeric.  String variables are designated by the letter 'A', and for dates and times you should use a Stat/Transfer date input format.  For dates in the format 10-MAY-1950, for example, you would use (%d-%m-%Y), the example above.  Note that different formats can be used for different date and time variables.


Numeric variables can be also be specified with a starting column location, and a format giving the variable width and an optional number of implied decimal places.  In this case, the ending column location is not necessary.  The format is given in parentheses after the starting column location.


          variablename  begcol   (fwidth.n)


If a decimal point is present in the input number, it is read and used.  If a decimal place is not  found in the data, the number will be divided by 10 to the nth power, where n is the number of decimal placed given in the format.


For example,


          income  6  (F10.2)


will read the variable 'income' beginning in column 6, with a width of 10.  If the number found in the field is 2.00 the result will be 2.  If  the data contain 200 the result will be two as well.


The input widths, whether given explicitly or with a format giving the variable width, may be used to specify the output width when writing program files with fixed format ASCII data.  In order for these widths to be used in the output file, you should check the option Preserve Input Widths in the ASCII/Text Write section of the Options dialog box or use the SET command PROG-PRESERVE-WIDTHS Y in the command processor.



Variable Lists   If you have adjacent variables in a record with the same widths and of the same type, you can use a list to specify them in the VARIABLES command.


The width specified by the given column range is divided by the number of variables in the list to give the width of each variable.  If the number of variables does not give an integer when divided into the width specified, then an error is returned.


For example,


           incsat housesat 22-23


will describe the numeric variable 'incsat' as having a width of one, located in column 22 and the numeric variable 'housesat' as having a width of one in column 23.



Variable Range   If you have adjacent variables in a record with the same widths and types and with names of the form variablen, you can use a range to specify them in the VARIABLES command.


For example,


          scale1 to scale10  24-43


will expand to 10 variables named 'scale1', 'scale2' ... 'scale10', each of width 2 and located starting in column 24.


As is the case with using variable lists, if the number of variables does not give an integer when divided into the width specified, then an error is returned.



Variable Specifications for Delimited Data


For delimited input data, the required elements after the VARIABLES command are a list of variable names, followed by their variable types in parentheses.


For example, a Schema file might contain the following VARIABLES command:



Variables



   ID

(A5)


   Age

(F)


   Name

(A16)


   Sex

(F2)


   Birthdate

(%Y-%m-%d)

  

Variable names must begin with a letter or an underscore.  If they have embedded spaces (not recommended) they must be enclosed by single or double quotes.


Variable numbers (starting with 1) are written by default in version 13 of Stat/Transfer.  They are optional when reading data described by a schema, but they are useful because they allow variables to be deleted or re-arranged. 


The variable type is always necessary for delimited data.


String variables are indicated by the letter 'A' followed by a width.  The width specification is required.


Numeric data are indicated by the letter 'F'.  A width is not required for numeric data, but may be given in order to specify the output width when writing program files with fixed format ASCII data.  You can give either a width such as 'F2" or a width and implied number of decimals, such as 'F4.2'.  In order for these widths to be used in the output file, you should check the option Preserve Input Widths in the ASCII/Text Write section of the Options dialog box.


For dates and times, you should use a Stat/Transfer date input format, as documented in Date/Time Formats - Reading. For dates in the format 1950-MAY-10, for example, you would use (%Y-%m-%d). The formats given here override those selected in the Date/Time Formats - Reading section of the Options dialog box. Note that different formats can be used for different date and time variables.



Optional Elements for the Variables Command


Missing Value Specifications

Missing values can be coded in your input data file as blank, or as one of the extended missing values '.' and '.a-.z'.  See the ASCII/Text Files - Read Options for specifying extended missing values.


However, it is probably better practice to code missing values numerically in your data and then assign these numbers to 'missing' as the data are read in.  This allows the missing data to become a subject of analysis.


Up to three missing value specifications can be entered for each variable, variable list or variable range.  These are entered on the same line as the variable name, variable list, or variable range, following the column numbers and type (if present).  The missing value specifications are entered in square brackets and are separated by commas.


A number given by itself is an "equal-to" specification.  That is, if an input value is equal to the number, it will be considered missing.  In addition, missing values specifications can be entered as '<=' (less than or equal) plus a number and '>=' (greater than or equal) plus a number.


For a fixed format input file, for example, suppose the values `0', `98', and `99' for the variable 'age' in the input data are all used to indicate a missing value.  Then


       Age 6-7   [0,98,99]

       Age 6-7   [0,>=98]

       Age 6-7   [<=1,98,99]


are all equivalent (assuming positive values for the variable age).


For a delimited input file,  this would be specified as


       Age (F) [0,98,99]


and so on.


Variable Labels

If you only have one variable on a line (not a list or a range), and wish to use variable labels, then it is most convenient to put them on the same line as the variables.


The variable labels are enclosed in curly braces, as in


       Age 6-7  [0,98,99] {Age of Respondent}


If you are using a range or a list to define several similar variables and you wish to label them, you must use the VARIABLE LABELS command, described below.


Value Label Tag

If you wish to attach labels to the values of your variables, you must define the label set using the VALUE LABELS command, discussed below.  You then attach the label to a particular variable or set of variables using a tag. The tag begins with a backslash '\'.


For example, if instead of setting values to missing, you might wish to label them instead.


       Age 6-7 {Age of Respondent} \agelab


where the value label tag '\agelab' is defined using the VALUE LABEL command as shown in the example below.




VARIABLE LABELS Command

If you are using lists or range specifications in the VARIABLES command and need to define labels for those variables, then you must use the VARIABLE LABELS command.  This command is followed by as many variable name / variable label pairs as needed.


For example:


Variable Labels

    Housesat "Satisfaction with Housing"

    Incsat  "Satisfaction with Income"


will label the variable 'Housesat' as "Satisfaction with Housing" and the variable 'Incsat' as "Satisfaction with Income".


If the labels contain embedded blanks, they should be enclosed in single or double quotes.




VALUE LABELS Command

You can define sets of value labels with the VALUE LABEL command.  The elements are a tag, preceded by a backslash and then value/ label pairs. For instance:


Value Labels

    \Agelab

      0     Inapplicable

      98     "Not Ascertained"

      99     Refused


If the labels contain embedded blanks, they should be enclosed in single or double quotes.



DATA Command

The input data can be read from the same file as the Schema commands, rather than read from a separate file.  The DATA command, which must follow all of the rest of the Schema commands, tells Stat/Transfer to treat everything that follows as data.


When the DATA command is used, the FILE command must be omitted.



Schema File Comments

You can put comments in the Schema file with two slashes.  Everything that follows on the line will be treated as a comment.


  // this is a comment




An Example Schema File for a Fixed Format File


File "/my data files/new/big.file"

Variables

  ID     1-5 {Arbitrary ID Number}

  Age    6-7 [0,98,99] {Respondent's Age} \agelab

  Name   8-20 (A)

  Sex    21 \sex

  incsat housesat 22-23 [0,8,9] \sat

  famsat 52 [0,8,9] {Satisfaction with Family} \sat

Variable Labels

  Incsat  "Satisfaction with Income"

  Housesat "Satisfaction with Housing"

Value Labels

  \agelab  //We only need labels for missing values

         0  "Not applicable"

        98 "Not Ascertained'

        99 Refused

  \sex

        1 Male

        2 Female

  \sat

        1 "Not Satisfied"

        2 "Somewhat Satisfied"

        3 "Very Satisfied"

        8 "Not Ascertained"

        9 Refused



Assuming this Schema file had the name survey.sts, you would read it into Stat/Transfer by selecting "ASCII - Fixed format (S/T Schema)" from the user interface, and then selecting survey.sts for the input file name.  The Schema would then point to the data file big.file, in the directory /my data files/new/.


If the data file were named survey.dat and placed in the same directory as the Schema file, then the FILE command would be omitted, since Stat/Transfer would find it without it being named in the FILE command.


Alternatively, you could put the data in the same file as the Schema and use the DATA command at the end of the Schema, but before your data, to tell Stat/Transfer to read your data from the same file, starting in the line following the command.


Using Schemas for Delimited Data to Exert Control over your Data

There are several common problems that can be solved by this technique.  For instance you may wish to:



The procedure is simple.  Make sure the option in "Schema Variable Numbers" option in "ASCII - Write" is checked (that is the default).  Then Select "ASCII Text Delimited (S/T Schema) as the output file type.  For instance if you were to write the Stat/Transfer demo worksheet file demo.wk1, you would get the following schema:


FORMAT delimited commas

VARIABLES

   ID                       1         (A4)

   NAME                     2         (A12)

   BIRTHDATE                3         (%m/%d/%Y)

   INCOME                   4         (F4)

   EDUC                     5         (F2)

   SEX                      6         (F1)

   RATE                     7         (F8)

   COUNT                    8         (F7)




Note the variable numbers (1-8).  Those make it possible to arbitrarily re-arrange or delete variables.  Because these data have been optimized by default, the widths for each variable are the minimum in which they can be represented.  You can change their widths by making them wider, but not narrower.   But make sure you check the "General" Options to preserve numeric and/or string widths.


You can also change the types.  For instance if you wanted to make the variable ID, into a number rather than a string, you would simply change (A4) to (F4).


If you wanted to add a label, you can do it using techniques described above.  For instance "SEX" could become (simultaneously changing its name):

       Gender                6         (F1) {Sex of Respondent} /sex


The order of variables are easy to change, just move the lines around with your editor (but don't change the variable numbers!).  For instance if we wanted a two variable output dataset with a different order, we could have:


VARIABLES

       COUNT                    8         (F7)

       BIRTHDATE                3         (%m/%d/%Y)