The import of data into R is still a bit complicated if the original data are stored in the SAS-data format. Though solutions have existed for a long time, see e.g. the R-documentation concerning Importing from other statistical systems, it is complicated to use these facilities. They are based on functions which need to call the SAS-program from R. The settings for doing this do not always correspond to the settings on the computers we use.
Instead we will focus on the export problem from the SAS point of view. We will present three different transfer formats, and illustrate some of their benefits and short-comings.
As a start we use a short SAS program to generate a dataset that will illustrate our main points. As shown in Table 1 the dataset has a very long name, some of variable are long and they even share the initial character sequence. In addition SAS formats (date and numeric) are specified for some of the variables. Both short and long labels are specified.
libname sasdata "C:/SASImport";
data sasdata.RImportTest ;
do i=1 to 10 ;
if i>5 then y=. ;
label x= ’x has a short description’
y= ’A very long description that is clearly longer than allowed’;
format y 8.1 Weightincreasedate date7. Dayasnumber 8.0;
To read SAS-files in R it is necessary to store them in an intermediate file format. Unfortunately, this cannot be done without some loss of information. The solutions presented here use three different intermediate formats. Each solution has its own benefits.
The most complicated part is to translate the SAS data format into the intermediate format. This should be done within SAS, and thus requires a SAS license. The second major obstacle is to transfer the specialised SAS information from the files, such as formatting (e.g. dates) and labels.
In the following the export to the three different formats is shown. It seems that three different methods have to be used within SAS. We will present each of these in the following section.
Actually, we have defined some R-functions that specify the SAS-program and executes SAS as a batch command automatically directly from R. The functions are presented in appendix A.
This uses the PROC EXPORT in SAS. Actually, this is called by the Export data wizard of SAS. To use the wizard, choose File|Export data from the main menu in SAS, and it guides you through the necessary specification. You can choose to save the SAS-program during the process. Then you have a template that can be used later on. The resulting lines are shown in Table 2.
The resulting rimporttest.csv file, may be read from the read.csv function in R.
In principle and according to the documentation it should be possible to us the method described in section 2.1, if we specify DBMS=EXCEL in the PROC EXPORT. However, it does not seem to work, at least on my machine.
Therefore another method is employed. This method uses the new facility in SAS 9. Here you may specify the database management system directly in the libname statement. The necessary program statements are shown in Table 3.
There are two irritating infelicities in this approach. Firstly, the output file may not exist before. If this is the case the program ends with an error message. Therefore use a new name. Actually the filename used in Table 3 is generated in the R-function readsas.xls (see appendix A) using a random number generator. Secondly, the xls-file is locked by SAS until you release it and cannot be used by other programs. The easiest way of doing this, is to exit SAS.
Finally, the xls dataset is read into R using the read.xls function in the package xlsReadWrite.
This is the most comprehensive transfer option. Originally it was used for copying SAS-files between SAS installed on different operating systems. Unfortunately, the facility has not been upgraded since SAS version 6. Therefore you need to downgrade your data set in two steps. First step is to reduce the name of the dataset to only 8 characters. You can do this using Windows explorer. E.g if your data is in the file longdatasetname.sas7bdat, copy the file and rename the new file to shortdat.sas7bdat. Then you can address the dataset as e.g. data a; set sasdata.shortdat; in SAS. You may also use this facility to copy the dataset, using the following program lines:
data sasdata.shortdat; set sasdata.longdatasetname;.
Second step is to use proc copy to reduce the length of variable names and labels, as shown in Table 4. You need initially to specify that you should restrict the valid names to those that were allowed in version 6 (options VALIDVARNAME=V6;). If the data statement is used for copying to a short dataset name, it is to specify this option after the copying. If it is specified before the dataset with the short name is generated, the copying is not allowed, because the long dataset name is not a valid name in version 6.
Again the strange naming of the dataset is because it is generated by the random generator in R-
Finally, the xpt dataset is read into R using the read.xport function in the package SASxport.
To illustrate the result of the data import, the result is presented of importing the data set via the three different intermediate formats. The function read.sas in appendix A is used. However, the same result would have been obtained by running the three different SAS-programs to generate to files in the three formats, and subsequently, using either read.csv, read.xls or read.xport.
The variable names are correct. The date variable is transferred as a character string and need to be translated into the R date format, in a subsequent step. The labels of the variables are not transferred.
In the xls format the variable names are also correct. The date variable is transferred as a integer value. The date value corresponds to the representation in R’s Date representation, but you need to specify that in a subsequent step. Note that the numeric representation of date values in SAS (variable Dayasnumber) differs from those in R and Excel (variable Weightincreasedate). The labels of the variables are not transferred.
A note of caution: There are some limitations to the size of an excel dataset, both concerning number of variables and number of observations. We have not yet tested whether this problem is circumvented because we do not call excel explicitly.
In the xport format the variable names are no longer correct, but translated into upper case names with a length of 8-characters. If the starting sequence is identical SAS uses digits to distinguish between the names. The date variable is transferred as a integer value but with the ClassDate. Thus they are printed in a readable format, and you can se from data structure that it is a date value. Note that the numeric representation of date values in SAS differs from those in R and Excel. Finally, the labels of the variables are transferred, even though the long label is shortened to a length of 40 characters.
In most cases the use of the xls-format is probably the most convenient. If the dataset is documented with labels the xport format may be the best, even though the variable names is mangled. Finally, the csv-format may be the best for very large datasets that cannot be handled within the limitations of R.
In this section the R functions are presented. They may also be downloaded here.