Sampling Functions
Previous Topic  Next Topic 

Three functions are available for sampling.


Random Samples

The first function

   

     samp_rand(prop)

    

allows for simple random sampling.  Each case is selected with a probability equal to prop.


For example, for a random sample of one tenth of a data set, use:

     

    where samp_rand(.1)


Random Samples of Fixed Size

The second function

    

    samp_fixed(sample_size,total_observations)

     

allows a random sample of fixed size to be drawn.  When using this function, the first case is drawn with a probability of  sample_size/total_observations, and the succeeding i'th case is drawn with a probability of  (sample_size - hits) / (total_observations - i).


For example, if you had a data set of 1000 cases and wished for a random sample of 25 cases, you would specify:


    where samp_fixed(25,1000)


Systematic Random Samples

Finally, a third function

    

    samp_syst(interval)


performs a systematic sample of every n'th case after a random start.  For instance, to take every 6'th case, use:


    where samp_syst(6)


Sampling Subsets of the Input Data

Expressions are evaluated from left to right.  You can thus sample from a subset of your cases by subsetting them first and then sampling.


For example, to take a random half of high school graduates, use:


   where schooling >= 12 & samp_rand(.5)



Sampling Seed and Reproducible Samples

The random number generator that provides the basis of these sampling routines is 'rand_port()' in Jerry Dwyer, "Quick and Portable Random Number Generators."  C Users Journal, June, 1995,  pp. 33-44.  By default, it is seeded using a permutation of the time of day, and will yield a different sample on each run.


If you need a reproducible sample, you can generate it by using the same seed each time.  The seed is entered in the General Options section of the Options dialog box and should  be a positive integer in the range of one through 2,147,483,646.