Data Analysis using the SAS Language/Examples
Sample code highlights features and demonstrates how to accomplish a task. Understanding the syntax of individual statements and procedures doesn't provide the high level needed for for understanding SAS programs.
SAS is a kit full of tools and parts that are made to work well together. As you review these examples you will see how the data step amd proc steps dovetail together to provide a powerful analytical platform.
Subject classification: this is a science resource. |
Subject classification: this is a statistics resource. |
Subject classification: this is an information technology resource. |
Educational level: this is a tertiary (university) resource. |
Analyzing data using by groups
editThis technique can be used with most SAS procedures. First, sort the data by a group identifier, then use the by statement in a procedure to do an independent analysis for each group.
Problem Description
editStatistics were collected on different charactistics for a variety of wines from different vineyards and vintages. A separate model is developed for each vineyard using the same variables. The goal, for this problem, is to build a regression model of wine sales using these characteristics.
SAS Code
edit **********************************************************************************;
***
*** The wines data set has eight variables: ;
*** vineyard, label, vintage, pct_alchohol, aroma, coloration, price, sales.;
*** vineyard, label and vintage can all be used as group ids;
*** First lets examine all the sales for each vineyard independently ;
*** ;
**********************************************************************************;
Proc sort data=wines;
by vineyard;
run;
*** The Sort prepares the data so it can be analyzed for each vineyard;
proc corr data=wines;
title "Correlation matrix for wine data";
by vineyard;
var vintage pct_alchohol aroma coloration price sales;
run;
proc reg data=wines;
title "Regression model for sales for each vineyard";
by vineyard;
model sales=vintage pct_alchohol aroma coloration price;
run;
quit;
*** Now lets use both vineyard and label;
Proc sort data=wines;
by vineyard label;
run;
*** Sort prepares the data so it can be analyzed for each combination of vineyard and label;
proc corr data=wines;
title "Correlation matrix for wine data";
by vineyard label;
var vintage pct_alchohol aroma coloration price sales;
run;
proc reg data=wines;
title "Regression model for sales for each vineyard";
by vineyard label;
model sales=vintage pct_alchohol aroma coloration price;
run;
quit;
Explanation
editSee: proc sort for more information on sorting.
The data set must be sorted by the variable that we want to group each analysis by. In this case, the vineyard. Sorting puts a note in the data set so other procedures know the data set is sorted and on which variables it is sorted by.
Conducting a T-test to determine change
editT-tests, or their equivalent, are available in several SAS procedures. Proc TTest compares one set of observations with another set using a class variable to distinguish each group. One limitation with the ttest procedure is the inability to test difference pre and post effect across a set of subjects.
Problem Description
editA sample of students was selected and measured using a test. Then an effect, or intervention, was applied and the students are tested again. Now we have two data sets one for the pretest period and the other for the posttest period.
We need to calculate the difference between pre and post and test to see if there is a significant difference. This is a one-tailed test, however SAS does a two tail-test so the result needs to be adjusted to get the one-tail result.
SAS Code
editproc sort data=preTest;
by student_ID;
run;
proc sort data=postTest;
by student_ID;
run;
data testGroup;
merge preTest(in=a rename=(score=pre))
postTest(in=b rename=(score=post));
by student_ID;
if a and b;
difference=post-pre;
run;
Title "Student's T test of difference";
proc means data=testGroup n mean stddev stderr t prt;
by student_ID;
var difference;
run;
Explanation
editThe two data sets, pretest and posttest, have the same variables, student_ID and score. In order to merge them, they must be sorted by student_ID and score must be renamed. Failure to rename score will cause the varaible to only have the value from the posttest data set. There will be four variables in the final data set, student_id, pre, post and difference. Difference is obtained by subtracting Pre from post. Difference is the change in score for each student for the period from before until after the intervention. The Means produces the student's t statistic and the probabiity of this getting this value given the null hypothesis is true (i.e. difference = 0).
Reading from multiple text files
editPurpose
editMany times data for different parts of an organisation are delivered as separate files. These files may be organised the same way, same variables, perhaps the same format. Manual concatenation of these files may not always be convenient and will add to system overhead. Here is some code for automating the process. It also demonstates some new statement options.
SAS Code
edit %let path=c:\reports\;
filename getdir pipe "dir &path.*.txt /b";
data dir_rpts;
length filename $ 100;
infile getdir truncover;
input filename $20.;
filename="&path."||filename;
run;
data region_data;
set dir_rpts;
infile dummy filevar=filename end=eof;
do while (not eof) ;
input region $15. @20 sales 10.;
output;
end;
run;
Explanation
editAn assumption has been made that all the reports are in a directory called reports on the C: drive and that these files have the extension txt. the pipe option lets sas read from the dir command output (the list of files with the txt extension). A complete filename is created by appending the path to the filename. This list of files is stored in a SAS data set which is the input for the next step. The filevar option on the infile statement give the name of the variable that contains the filename that we stored in the dir_rpts data set. The variable, eof, will be false until the end of file condition is reached. Once the end of file condition is reached, the next filename will be obtained from dir_rpts data set. The file will be input and the process repeats. The data step ends when all the filenames in dir_rpts have been processed.