document.write(header);

How to import SAS data set into R


November 7, 2007

1 Introduction

The import of data into R is still a bit complicated if the original data are stored in the SAS-data format. Though solutions have existed for a long time, see e.g. the R-documentation concerning Importing from other statistical systems, it is complicated to use these facilities. They are based on functions which need to call the SAS-program from R. The settings for doing this do not always correspond to the settings on the computers we use.

Instead we will focus on the export problem from the SAS point of view. We will present three different transfer formats, and illustrate some of their benefits and short-comings.

As a start we use a short SAS program to generate a dataset that will illustrate our main points. As shown in Table 1 the dataset has a very long name, some of variable are long and they even share the initial character sequence. In addition SAS formats (date and numeric) are specified for some of the variables. Both short and long labels are specified.


Table 1: SAS-code for generating example data set
  libname sasdata "C:/SASImport";
  
  data sasdata.RImportTest ;
  
  do i=1 to 10 ;
  
    x=i ;
    y=i*i ;
    if i>5 then y=. ;
    Weightincrease=y-x ;
    Weightincreasedate=mdy(11,i,2007) ;
    Dayasnumber =Weightincreasedate;
    output ;
  end;
  
  label x= x has a short description
        y= A very long description that is clearly longer than allowed;
  format y 8.1 Weightincreasedate date7. Dayasnumber 8.0;
  

2 Reading SAS files

To read SAS-files in R it is necessary to store them in an intermediate file format. Unfortunately, this cannot be done without some loss of information. The solutions presented here use three different intermediate formats. Each solution has its own benefits.

  1. Comma separated file (csv). This format can be read into R via the read.csv function
  2. Excel file format (xls). This format can be read into R via the read.xls function in the xlsReadWrite package.
  3. SAS xport file (xpt). This format can be read into R via the read.xport function in the SASxport package.

The most complicated part is to translate the SAS data format into the intermediate format. This should be done within SAS, and thus requires a SAS license. The second major obstacle is to transfer the specialised SAS information from the files, such as formatting (e.g. dates) and labels.

In the following the export to the three different formats is shown. It seems that three different methods have to be used within SAS. We will present each of these in the following section.

Actually, we have defined some R-functions that specify the SAS-program and executes SAS as a batch command automatically directly from R. The functions are presented in appendix A.

2.1 Writing to csv - files

This uses the PROC EXPORT in SAS. Actually, this is called by the Export data wizard of SAS. To use the wizard, choose File|Export data from the main menu in SAS, and it guides you through the necessary specification. You can choose to save the SAS-program during the process. Then you have a template that can be used later on. The resulting lines are shown in Table 2.


Table 2: SAS-code for export to csv file
  libname sasdata "C:/SASImport";
  proc export data=sasdata.rimporttest
        outfile= "rimporttest.csv"
        DBMS=CSV REPLACE;

The resulting rimporttest.csv file, may be read from the read.csv function in R.

2.2 Writing to xls - files

In principle and according to the documentation it should be possible to us the method described in section 2.1, if we specify DBMS=EXCEL in the PROC EXPORT. However, it does not seem to work, at least on my machine.

Therefore another method is employed. This method uses the new facility in SAS 9. Here you may specify the database management system directly in the libname statement. The necessary program statements are shown in Table 3.


Table 3: SAS-code for export to xls file
  libname sasdata "C:/SASImport";
  libname out excel  "C:/SASImport/file798b12e1.xls";
  data out.rimporttest; set sasdata.rimporttest; run;

There are two irritating infelicities in this approach. Firstly, the output file may not exist before. If this is the case the program ends with an error message. Therefore use a new name. Actually the filename used in Table 3 is generated in the R-function readsas.xls (see appendix A) using a random number generator. Secondly, the xls-file is locked by SAS until you release it and cannot be used by other programs. The easiest way of doing this, is to exit SAS.

Finally, the xls dataset is read into R using the read.xls function in the package xlsReadWrite.

2.3 Writing to xport - files

This is the most comprehensive transfer option. Originally it was used for copying SAS-files between SAS installed on different operating systems. Unfortunately, the facility has not been upgraded since SAS version 6. Therefore you need to downgrade your data set in two steps. First step is to reduce the name of the dataset to only 8 characters. You can do this using Windows explorer. E.g if your data is in the file longdatasetname.sas7bdat, copy the file and rename the new file to shortdat.sas7bdat. Then you can address the dataset as e.g. data a; set sasdata.shortdat; in SAS. You may also use this facility to copy the dataset, using the following program lines:

data sasdata.shortdat; set sasdata.longdatasetname;.

Second step is to use proc copy to reduce the length of variable names and labels, as shown in Table 4. You need initially to specify that you should restrict the valid names to those that were allowed in version 6 (options VALIDVARNAME=V6;). If the data statement is used for copying to a short dataset name, it is to specify this option after the copying. If it is specified before the dataset with the short name is generated, the copying is not allowed, because the long dataset name is not a valid name in version 6.


Table 4: SAS-code for export to xport file
  options VALIDVARNAME=V6;
  libname source  "C:/SASImport";
  libname xportout xport "C:/SASImport/le26ca58.xpt" ;
  proc copy in=source out=xportout memtype=data;
  select le26ca58; run;

Again the strange naming of the dataset is because it is generated by the random generator in R-

Finally, the xpt dataset is read into R using the read.xport function in the package SASxport.

3 Reading the dataset

To illustrate the result of the data import, the result is presented of importing the data set via the three different intermediate formats. The function read.sas in appendix A is used. However, the same result would have been obtained by running the three different SAS-programs to generate to files in the three formats, and subsequently, using either read.csv, read.xls or read.xport.

  > library(xtable)
  > library(SASxport)
  > library(xlsReadWrite)
  > datcsv <- readsas("rimporttest", via = "csv")
  > datxls <- readsas("rimporttest", via = "xls")
  > datxport <- readsas("rimporttest", via = "xport")

3.1 Output from csv-file

  > (datcsv)

      i  x  y Weightincrease Weightincreasedate Dayasnumber
  1   1  1  1              0            01NOV07       17471
  2   2  2  4              2            02NOV07       17472
  3   3  3  9              6            03NOV07       17473
  4   4  4 16             12            04NOV07       17474
  5   5  5 25             20            05NOV07       17475
  6   6  6 NA             NA            06NOV07       17476
  7   7  7 NA             NA            07NOV07       17477
  8   8  8 NA             NA            08NOV07       17478
  9   9  9 NA             NA            09NOV07       17479
  10 10 10 NA             NA            10NOV07       17480

  > str(datcsv)

  data.frame: 10 obs. of  6 variables:
   $ i                 : int  1 2 3 4 5 6 7 8 9 10
   $ x                 : int  1 2 3 4 5 6 7 8 9 10
   $ y                 : num  1 4 9 16 25 NA NA NA NA NA
   $ Weightincrease    : int  0 2 6 12 20 NA NA NA NA NA
   $ Weightincreasedate: Factor w/ 10 levels "01NOV07","02NOV07",..: 1 2 3 4 5 6 7 8 9 10
   $ Dayasnumber       : int  17471 17472 17473 17474 17475 17476 17477 17478 17479 17480

The variable names are correct. The date variable is transferred as a character string and need to be translated into the R date format, in a subsequent step. The labels of the variables are not transferred.

3.2 Output from xls-file

  > (datxls)

      i  x  y Weightincrease Weightincreasedate Dayasnumber
  1   1  1  1              0              39387       17471
  2   2  2  4              2              39388       17472
  3   3  3  9              6              39389       17473
  4   4  4 16             12              39390       17474
  5   5  5 25             20              39391       17475
  6   6  6 NA             NA              39392       17476
  7   7  7 NA             NA              39393       17477
  8   8  8 NA             NA              39394       17478
  9   9  9 NA             NA              39395       17479
  10 10 10 NA             NA              39396       17480

  > str(datxls)

  data.frame: 10 obs. of  6 variables:
   $ i                 : num  1 2 3 4 5 6 7 8 9 10
   $ x                 : num  1 2 3 4 5 6 7 8 9 10
   $ y                 : num  1 4 9 16 25 NA NA NA NA NA
   $ Weightincrease    : num  0 2 6 12 20 NA NA NA NA NA
   $ Weightincreasedate: num  39387 39388 39389 39390 39391 ...
   $ Dayasnumber       : num  17471 17472 17473 17474 17475 ...

In the xls format the variable names are also correct. The date variable is transferred as a integer value. The date value corresponds to the representation in R’s Date representation, but you need to specify that in a subsequent step. Note that the numeric representation of date values in SAS (variable Dayasnumber) differs from those in R and Excel (variable Weightincreasedate). The labels of the variables are not transferred.

A note of caution: There are some limitations to the size of an excel dataset, both concerning number of variables and number of observations. We have not yet tested whether this problem is circumvented because we do not call excel explicitly.

3.3 Output from xport-file

In the xport format the variable names are no longer correct, but translated into upper case names with a length of 8-characters. If the starting sequence is identical SAS uses digits to distinguish between the names. The date variable is transferred as a integer value but with the ClassDate. Thus they are printed in a readable format, and you can se from data structure that it is a date value. Note that the numeric representation of date values in SAS differs from those in R and Excel. Finally, the labels of the variables are transferred, even though the long label is shortened to a length of 40 characters.

  > (datxport)

      I  X  Y WEIGHTIN   WEIGHTI2 DAYASNUM
  1   1  1  1        0 2007-11-01    17471
  2   2  2  4        2 2007-11-02    17472
  3   3  3  9        6 2007-11-03    17473
  4   4  4 16       12 2007-11-04    17474
  5   5  5 25       20 2007-11-05    17475
  6   6  6 NA       NA 2007-11-06    17476
  7   7  7 NA       NA 2007-11-07    17477
  8   8  8 NA       NA 2007-11-08    17478
  9   9  9 NA       NA 2007-11-09    17479
  10 10 10 NA       NA 2007-11-10    17480

  > str(datxport)

  data.frame: 10 obs. of  6 variables:
   $ I       : int  1 2 3 4 5 6 7 8 9 10
   $ X       : atomic  1 2 3 4 5 6 7 8 9 10
    ..- attr(*, "label")= chr "x has a short description"
   $ Y       : atomic  1 4 9 16 25 NA NA NA NA NA
    ..- attr(*, "label")= chr "A very long description that is clearly"
   $ WEIGHTIN: atomic  0 2 6 12 20 NA NA NA NA NA
    ..- attr(*, "label")= chr "Weightincrease"
   $ WEIGHTI2:Class Date  atomic [1:10] 13818 13819 13820 13821 13822 ...
    .. ..- attr(*, "label")= chr "Weightincreasedate"
    .. ..- attr(*, "SASformat")= chr "DATE7."
   $ DAYASNUM: atomic  17471 17472 17473 17474 17475 17476 17477 17478 17479 17480
    ..- attr(*, "label")= chr "Dayasnumber"

  > label(datxport$X)

  [1] "x has a short description"

  > label(datxport$Y)

  [1] "A very long description that is clearly"

4 Summary

In most cases the use of the xls-format is probably the most convenient. If the dataset is documented with labels the xport format may be the best, even though the variable names is mangled. Finally, the csv-format may be the best for very large datasets that cannot be handled within the limitations of R.

A R-functions

In this section the R functions are presented. They may also be downloaded here.

A.1 readsas.r

  "readsas" <-
  function(dataname,libname=sasdata,dir=getwd(),via=csv){
  
  if (via==csv) {
     expdata<-readsas.csv(dataname,libname=libname,dir=dir)
     } else {
     if (via==xls) {
        expdata<-readsas.xls(dataname,libname=libname,dir=dir)
        } else {
        # via = xport assumed
        expdata<-readsas.xport(dataname,libname=libname,dir=dir)
        }}
  expdata
  }

A.2 readsas.csv.r

  readsas.csv <-
  function(dataname,libname=sasdata,dir=getwd()){
   libnamestr<-paste(libname ,libname, ",dir,";,sep=’’)
   exportstr<- paste(proc export data=,libname,.,dataname,sep=’’)
   outfilestr<-paste(      outfile= ",dataname,.csv",sep=’’)
   dbmsstr<-               DBMS=CSV REPLACE;
  
   sasprogname<-paste(dataname,"exportcsv.sas",sep="")
   sasprog<-cat(libnamestr,exportstr,outfilestr,dbmsstr,
                sep="\n",file=sasprogname)
   ProgramPath<-Sys.getenv(ProgramFiles)
   sascmd=paste(",ProgramPath,\\SAS\\SAS 9.1\\sas.exe",sep="")
   #sascmd="C:\\Program Files\\SAS\\SAS 9.1\\sas.exe"
   system(paste(sascmd, sasprogname))
   read.csv(paste(dataname,.csv,sep=""))
   }

A.3 readsas.xls.r

  "readsas.xls" <-
  function(dataname,libname=sasdata,dir=getwd()){
  
   newdataname <- substr(tempfile(tmpdir=’’),2,13)
   libnamestr <- paste(libname ,libname, ",dir,";,sep=’’)
   outlibnamestr <- paste(libname out excel , ",dir,/,
                        newdataname,.xls";,sep=’’)
   datastepstr <- paste(data out.,dataname,; set ,
                        libname,.,dataname,; run;,sep=’’)
  
   sasprogname <- paste(dataname,"exportxls.sas",sep="")
   sasprog <-ťcat(libnamestr,outlibnamestr,datastepstr,
               sep="\n",file=sasprogname)
   ProgramPath <- Sys.getenv(ProgramFiles)
   sascmd = paste(",ProgramPath,\\SAS\\SAS 9.1\\sas.exe",sep="")
   system(paste(sascmd, sasprogname))
   read.xls(paste(newdataname,.xls,sep=""))
   }

A.4 readsas.xport.r

  readsas.xport <-
  function(dataname,libname=sasdata,dir=getwd(),filetype=sas7bdat){
  
  if (!file.exists(paste(dataname,.,filetype,sep=’’))) {
       print(!warning filetype does not exist)}
  
   #newdataname<-dataname
   newdataname<-substr(tempfile(tmpdir=’’),4,11)
  
  if (nchar(newdataname)>8) {newdataname<-substr(newdataname,1,8)}
  file.copy(paste(dataname,.,filetype,sep=’’),
   paste(newdataname,.,filetype,sep=’’))
  
   optionstr<-paste(options VALIDVARNAME=V6;)
   libnamestr<-paste(libname source , ",dir,";,sep=’’)
   xportlibnamestr<-paste(libname xportout xport ,",dir,"/",
   newdataname,.xpt" ;,sep=’’)
   copystr<- paste(proc copy in=source out=xportout memtype=data;,sep=’’)
   selectstr<-paste(select ,newdataname,; run;,sep=’’)
  
   sasprogname<-paste(dataname,"exportxport.sas",sep="")
   sasprog<-cat(optionstr,libnamestr,xportlibnamestr,copystr,
     selectstr,sep="\n",file=sasprogname)
   ProgramPath<-Sys.getenv(ProgramFiles)
   sascmd=paste(",ProgramPath,\\SAS\\SAS 9.1\\sas.exe",sep="")
   system(paste(sascmd, sasprogname))
   file.copy(paste(newdataname,.xpt,sep=’’),paste(dataname,.xpt,sep=’’))
   read.xport(paste(newdataname,.xpt,sep=""))
   }
counter