【SAS ADVANCE】Performing Advanced Queries Using PROC SQL

一、Objectives

display all rows, eliminate duplicate rows, and limit the number of rows displayed
subset rows using other conditional operators and caculated values
enhance the formatting of query output
Using summary functions, such as COUNT, with and without grouping
subset groups of data by using the HAVING clause
subset data by using correlatedd and noncorrelated subqueries
Validate query syntax.

二、SELECT语句的语法

1 SELECT column-1<,...,column-n>                       /*SELECT指定用户需要输出到output内的列*/
2       FROM table-1|view-1<,...,table-n|view-n>       /*FROM指定去查询的table 或者view*/
3       <WHERE expression>                             /*WHERE子句：用表达式来subset或者restrict数据集的条件*/
4       <GROUP BY column-1<,...,column-n>>            /*GROUP BY按后面指定的列将数据集分成若干组*/
5       <HAVING expression>                          /*HAVING子句：在group条件下，用表达式subset或者restrict分组后的数据*/
6       <ORDER BY column-1<,...,column-n>>;         /*ORDER BY：根据其后的变量对查询结果进行排序*/

【备注】：PROC SQL SELECT语句中的子句需要按照上述顺序排列。

三、Displaying All Columns

1. 利用SELECT *：可将所有列呈现出来

1 proc sql;
2      select *
3      from sasuser.staffchanges;

2. FEEDBACK选项（debugging tools：让用户可以清楚的看到what is being submitted to the SQL processor）

当指定SELECT *语句时，PROC SQL中的FEEDBACK选项则会在日志中输出expand list of columns（每一列的详细名称）。例如：

1 proc sql feedback;
2     select *
3         from sasuser.staffchanges;

则日志中会输出：

202 proc sql feedback;
203 select * 
204 from sasuser.staffchages;
NOTE: Statement tranforms to:
    
       select STAFFCHANGES.EmpID,
STAFFCHANGES.LastName, STAFFCHANGES.FirstName,
STAFFCHANGES.City, STAFFCHANGES.State,
STAFFCHANGES.PhoneNumber
       from SASUSER.STAFFCHANGES

【备注】：日志不仅仅将星号（*）展开成详细列表，还会resolves macro variables and places parentheses around expressions to show their order of evaluation.

四、控制输出行的方法

1. 通过OUTOBS=选项来限制 the Number of Rows Displayed

      General form:PROC SQL statment with OUTOBS= option:

      PROC SQL OUTOBS=n;  /*其中n指定了输出的行数，这里的OUTOBS=选项类似于DATA SET选项中的OBS=*/

【备注】：这里的OUTOBS=选项仅仅是限制了display出来的行数，但是没有限制读入的行数。如果用户需要限制读入的行数，则可用INOBS=选项来控制。

【例子】：

1 proc sql outobs=10;
2      select flightnumber,date
3          from sasuser.flightschedule;

日志提示：WARING: Statement terminated early due to OUTOBS=10 option.

2. 通过SELECT语句中的DISTINCT关键词来剔除重复行

【例子】：

/*代码1*/                                                /*代码2*/

proc sql outobs=12;                                     proc sql ;
     select flightnumber, destination                       select distinct  flightnumber, destination  
          from sasuser.internationalflights;                   from sasuser.internationalflights;
                                                                order by 1;/*按照SELECT语句中的第一个变量排序*/

代码2输出的结果，相比代码1，将其重复的Flightnumber-Destination组合删掉了。

3. 利用PROC SQL中的条件运算符subseting rows

PROC SQL中的条件运算符包括三种：comparison（例如：where membertype='GOLD'）

logical（例如：where visits<=3 or status='new'）

concetenation（连接符，例如：where name=trim(last)||','||first）

【例】：

1 proc sql;
2      select ffid, name, state, pointsused
3      from sasuser.frequentflyers
4      where member='GOLD'  and pointsused>0
5      order by pointsused;

【其他的条件运算符】：

Conditional Operator	Tests for...	Example
BETWEEN-AND	valules that occur within an inclusive range	where salary between 70000 and 80000
CONTAINS或者?	values that是否含有某一指定字符串	where name contains ‘ER’ 等价于where name ? 'ER'
IN	values that match列表中的某一个值	where code in ('PT','NA','FA')
IS MISSING或IS NULL	缺失值	where dateofbirth is missing或where dataofbirth is null
LIKE(with %,_)	values that match a specified pattern	where address like '% P%PLACE' （筛选出以字母P开头并以单词PLACE结尾的字符串）
=*	values that sounds like a specified value	*where lastname = 'Smith'**
ANY	values that meet a specified condition with respect to any one of the values returned by a subquery	where dataofbirth <any(select dateofbirth from sasuser.parollmaster where jobcode='FA3')
ALL	values that meet a specified condition with respect to all the values returned by a subquery	where dataofbirth <all(select dateofbirth from sasuser.parollmaster where jobcode='FA3')
EXISTS	the existence of values returned by a subquery	*where exists(select from sasuser.flightschedule where fa.empid=flightschedule.empid)**

【备注】：

上述条件运算符除了ANY, ALL和EXISTS外，其他都可以用于SAS的其他procedure中；
若需要对条件进行否定，则只需在运算符前面加上NOT即可（对ANY和ALL不适用）；
在CONTAIN运算符做匹配时，区分大小写；

在IS MISSING或IS NULL语句中，也可以用where boarded=.（数值型）或where flight=' '（字符型）来判断；

1 例子（供筛选的名字：Schmitt, Smith, Smithson, Smitt, Smythe）：
2 
3 where lastname =* 'Smith';
4 
5 上述供筛选的名字里面，只有Schmitt和Smithson没有被选中。

在LIKE条件运算符中，_代表任意单个字符，而%代表任意sequence of zero or more characters.（注意，有时_和%也会被当做通配符）

/*可供选择的几个名字列表：Diana, Diane, Dianna, Dianthus, Dyan*/
/*LIKE Pattern*/                          /*Name(s) Selected*/
LIKE 'D_an'                                      Dyan
LIKE 'D_an_'                                     Diana, Diane
LIKE 'D_an__'                                    Dianna
LIKE 'D_an%'                                     上述所有名字

=*（Sounds-like条件运算法）：souds-like利用SOUNDEX逻辑一次比较两个单词或表达式的每一列的值，筛选出任意contain a value that sounds like another value that you specify。

五、通过Calculated Values 来subsetting rows

1. PROC SQL运行Calculated Columns的原理

我们通过一个例子来说明PROC SQL运行calculated columns的原理，若在SAS中运行如下代码：

1 proc sql outobs=10;
2      select flightnumber, date, destination, 
3                boarded+transferred+nonrevenue as Total
4             from sasuser.marchflights
5             where total<100;

在这段代码运行后，SAS会在日志中提示错误：ERROR: The following columns were not found in the contributing tables: total.

【原因】：在SQL查询中，WHERE子句会比SELECT子句先执行。这样的话，SQL在表中查找WHERE子句中的变量名，而total此时还没有生成，因此会提示错误。

2. 在WHERE子句中：利用关键词CALCULATED来引用SELECT子句中新定义的列

用户可在WHERE子句中加入CALCULAED关键词，来引用新定义的列。例如：

1 proc sql outobs=10;
2       select flightnumber, date, destination, 
3                 boarded+transferred+nonrevenue as Total
4              from sasuser.marchflights
5              where calculated total<100;

3. 在SELECT子句中：利用关键词CALCULATED来引用SELECT子句中新定义的列

若在SELECT子句中定义了一个类似与上面Total的新列，然后还需要在Total的基础上再定义第二个新列，则也需要引用CALCULATED关键词来实现，例如：

1  proc sql outobs=10;
2        select flightnumber, date, destination, 
3                  boarded+transferred+nonrevenue as Total，
4                  calculated total/2 as Half
5               from sasuser.marchflights
6               where calculated total<100;

六、改善Query输出

1. 指定列的Format及Label

在默认情况下，PROC SQL输出列的格式为查询的table中原有格式（若格式已保存在table中）或系统默认格式（若未保存在table中）。若用户需要控制输出的列的格式，可通过指定类似于LABEL=和FORMAT=等data set选项来实现。当用户在SELECT子句中定义一个新列示，除了分配一个新的名字外，也可以通过分配一个label来实现。

【Data Set Option】: LABEL= (select hiredate label='Date of Hire')/*若用户不希望在输出中出现label，则可以指定NOLABEL系统选项*/

FORMAT=(select hiredate format=date9.)

【例】

1 proc sql outobs=15;
2     select empid labe='Employee ID',
3               jobcode label='Job Code',
4               salary,
5               salary *.10 as Bonus format=dollar12.2
6     from sasuser.payrollmaster
7     where salary>75000
8     order by salary desc;

2. 指定Titles和Footnotes

在PROC SQL查询中，若用户需要指定titles和footnotes，则须将TITLE和FOOTNOTE语句放在PROC SQL语句之前或者放在介于PROC SQL语句和SELECT语句之间。例如：

 1 proc sql outobs=15;
 2 title 'Current Bonus Information';
 3 title2 ‘Employee with Salaries > $75,000’;
 4     select empid label='Employee ID',
 5               jobcode label='Job Code',
 6               salary,
 7               salary*.10 as Bonus format=dollar12.2
 8      from sasuser.payrollmaster
 9      where salary>75000
10      order by salary desc;

3. 在输出中增加Character Constant

在PROC SQL中，用户可在SELECT子句中加入一个由引号括起来的字符串，这样就定义了一个类似于常值字符型的列。例如：

 1   proc sql outobs=15;
 2   title 'Current Bonus Information';
 3   title2 ‘Employee with Salaries > $75,000’;
 4       select empid label='Employee ID',
 5                 jobcode label='Job Code',
 6                 salary,
 7                 ‘bonus is:’,
 8                 salary*.10 format=dollar12.2
 9        from sasuser.payrollmaster
10        where salary>75000
11        order by salary desc;

在运行上述代码后，output中就会在salary后输出两列：一列是常值字符串列（所有的value都为bonus is:），另一列是由salary*.10构成的格式为dollar12.2的列，此时就不需要再为其定义一个别名了。

七、 Summarizing和Grouping Data

我们首先从一个例子出发，展示用summary函数来产生统计结果的方法：

1 proc sql;
2     select membertype, avg(milestraveled) as AvgMilesTraveled
3     from sasuser.frequentflyers
4     group by membertype

【SQL中影响summary函数的四个关键因素】（关键在于搞明白以下四种因素是怎么影响summary函数的作用的）

whether the summary function specifies one or multiple columns as arguments
whether the query contains a GROUP BY clause
if the summary function is specified in a SELECT clause, whether there are additional columns listed that are outside of a summary function
whether the WHERE clause, if there is one, contains only columns that are specified in the SELECT clause.

1. Summary函数及其参数个数

summary函数的参数可以是一个，也可以是多个，并且放在圆括号内。

有的函数，例如AVG、COUNT，是单参数函数。例如：proc sql; select avg(salary) as AvgSalary from sasuser.payrollmaster;

而有一些函数，例如MEAN和N，适用于单个参数，也可适用于多个参数。例如：proc sql outobs=10; select sum(boarded, transferred, nonrevenue) as Total from sasuser.data;

2. 作用在分组数据上的Summary函数

若query子句中包含GROUP BY子句，则SELECT子句中所有没有被summary函数作用的列，都应该在GROUP BY子句中列出来，否则可能会出现意料之外的结果。

【例】

1 proc sql outobs=10;
2      select jobcode, avg(salary) as AvgSalary
3      from sasuser.payrollmaster
4      group by jobcode;

3. SELECT子句与Summary函数的作用方式

在SELECT子句中，除了summary函数作用的列外，还可以列出其余没有被summary函数作用的列。SELECT子句中这些没有被summary函数使得PROC SQL展现出不同的output。下面举两个例子来说明这种不同：

 【例1】proc sql;    
          select avg(salary) as AvgSalary from sasuser.payrollmaster;

例1的SELECT子句中，没有列是outside of summary functions的，因此在其输出结果中：

若没有指定了GROUP BY子句，则通过summary函数输出整张表的一个single value;

若指定了GROUP BY子句，则对每一个组输出summary函数作用的single value。

【例2】 proc sql;  
                 select jobcode, gender, avg(salary) as AvgSalary 
                  from sasuser.payrollmaster
                   group by jobcode,gender ;

若没有指定了GROUP BY子句，则通过summary函数输出整张表的一个single value;

若指定了GROUP BY子句，则对每一个组displays all the rows of output with the single or grouped values repeated。

4. Argument（列）的个数对summary函数作用方式的影响

我们通过下面两个列子来说明列的个数对summary函数作用方式的影响：若参数为1个，则对所有数据集作用

【例1】 proc sql;  
         select avg(salary) as AvgSalary from sasuser.payrollmaster; /*PROC SQL查询结果为所有员工的薪水的平均值*/

【例2】 proc sql outobs=10;
                 select sum(boarded, transferred, nonrevenue) as Total
                 from sasuser.marchflights;
/*由于summary function(sum)包含了多个参数，因此结果中每一行都计算了作用于多列参数的统计量。在这里，输出了一列10行的数据集，每一行是这三列的值的和*/

5. GROUP BY对summary函数作用方式的影响

下面我们通过三种不同的情况下的summary函数的作用方式，来说明这三种情况的不同：

without GROUP BY子句

【例1】/*这里没有summary函数作的参数以外的列，因此仅输出一列：所有新水的平均值*/
 proc sql;  
         select avg(salary) as AvgSalary
          from sasuser.payrollmaster;

with Columns Outside of the Function

【例2】/*这里的输出结果为20行的数据集，第一列jobcode（不是summary函数生成，所以不同）；第二列为所有的salary的和，为single value，每一行都相同*/
 proc sql outobs=20;
        select jobcode, avg(salary) as AvgSalary
          from sasuser.marchflights;

with a GROUP BY子句

【例3】/*这里的输出结果为对每一个jobcode分组的值，分别输出结果*/
 proc sql;
        select jobcode, avg(salary) as AvgSalary format=dollar11.2
          from sasuser.marchflights
          group by jobcode;