从员工表和部门表联合查询的不同方式看CBO对不同SQL的优化

名词解释：

CBO Cost-Based Optimization 基于代价的优化器

往下进入正题：

有一张员工表这样设计：

create table emp(
id int,
name nvarchar2(20),
deptid int,
primary key(id));

可以这样给它塞入三十万条记录：

insert into emp
select rownum,
dbms_random.string('*',dbms_random.value(6,20)),
dbms_random.value(1,10)
from dual
connect by level<300001;

还有一张部门表这样设计：

create table dept(
id int,
name nvarchar2(20),
primary key(id));

这回选择逐条插值：

insert into dept(id,name) values('1','dev');
insert into dept(id,name) values('2','prod');
insert into dept(id,name) values('3','sales');
insert into dept(id,name) values('4','postsales');
insert into dept(id,name) values('5','market');
insert into dept(id,name) values('6','lab');
insert into dept(id,name) values('7','research');
insert into dept(id,name) values('8','adv');
insert into dept(id,name) values('9','hr');
insert into dept(id,name) values('10','mng');

需求出来了，要找员工名以AK打头，属于生产prod、销售sales、售后postsales和市场market四个部的员工。

按我机器上产生的结果，以AK开头的记录是466条，占emp表总数的0.15%；四个部的员工占dept表总数的4成。

按常规理解，应该是先过滤再连接最高效，但既然是实验，就把能出来正确结果的各种SQL都试试，比较一下。

满足这个需求的SQL不少，首先便是：

1.
select emp.name,dept.name from emp,dept where emp.deptid=dept.id and emp.name like 'AK%' and dept.id in (2,3,4,5)

这条SQL是让员工表和部门表内联，然后让表连接条件和两个过滤条件统一写到where从句中。

这是常规解法，效率未纳入考量范围。让我们看看解释计划是怎样的：

SQL> explain plan for select emp.name,dept.name from emp,dept where emp.deptid=dept.id and emp.name like 'AK%' and dept.id in (2,3,4,5);

已解释。

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 615168685

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    45 |  3150 |   486   (2)| 00:00:06 |
|*  1 |  HASH JOIN         |      |    45 |  3150 |   486   (2)| 00:00:06 |
|*  2 |   TABLE ACCESS FULL| DEPT |     4 |   140 |     3   (0)| 00:00:01 |
|*  3 |   TABLE ACCESS FULL| EMP  |   111 |  3885 |   483   (2)| 00:00:06 |
---------------------------------------------------------------------------


PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("EMP"."DEPTID"="DEPT"."ID")
   2 - filter("DEPT"."ID"=2 OR "DEPT"."ID"=3 OR "DEPT"."ID"=4 OR
              "DEPT"."ID"=5)
   3 - filter("EMP"."NAME" LIKE U'AK%' AND ("EMP"."DEPTID"=2 OR
              "EMP"."DEPTID"=3 OR "EMP"."DEPTID"=4 OR "EMP"."DEPTID"=5))

Note
-----

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
   - dynamic sampling used for this statement (level=2)

已选择23行。

从解释计划里，我们发现CBO是让两表先过滤再连接，cost是486. 也就是说这种SQL没考虑效率问题，CBO帮咱们考虑了，并按最优方案进行。

第二种SQL

select a.name,b.name from (select * from emp where name like 'AK%') a ,(select * from dept where id in (2,3,4,5)) b where a.deptid=b.id

这句SQL知道先主动过滤两表再进行连接，表面上是比第一种要快的，让我们看看解释计划里它会如何表现：

SQL> explain plan for select a.name,b.name from (select * from emp where name like 'AK%') a ,(select * from dept where id in (2,3,4,5)) b where a.deptid=b.id;

已解释。

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 615168685

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    45 |  3150 |   486   (2)| 00:00:06 |
|*  1 |  HASH JOIN         |      |    45 |  3150 |   486   (2)| 00:00:06 |
|*  2 |   TABLE ACCESS FULL| DEPT |     4 |   140 |     3   (0)| 00:00:01 |
|*  3 |   TABLE ACCESS FULL| EMP  |   111 |  3885 |   483   (2)| 00:00:06 |
---------------------------------------------------------------------------


PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("EMP"."DEPTID"="DEPT"."ID")
   2 - filter("ID"=2 OR "ID"=3 OR "ID"=4 OR "ID"=5)
   3 - filter("NAME" LIKE U'AK%' AND ("EMP"."DEPTID"=2 OR
              "EMP"."DEPTID"=3 OR "EMP"."DEPTID"=4 OR "EMP"."DEPTID"=5))

Note
-----
   - dynamic sampling used for this statement (level=2)

已选择22行。

我们发现，CBO知道该先过滤再连接，如果SQL这样做了，它也就顺其自然，于是解释计划和上面一样，Cost也是486.

第三种SQL

select a.name,a.dname from (select emp.name,dept.name as dname,dept.id from emp,dept where emp.deptid=dept.id) a where a.name like 'AK%' and a.id in (2,3,4,5)

这种SQL是不管数据多少故意硬让emp和dept两表连接，形成一个逻辑大表，再在出来的结果集里筛选。

理论上，我们知道这是最费力的方案，但是，Oracle有CBO，它会对SQL进行优化，看解释计划暴露的CBO意图如何？

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 615168685

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    45 |  3150 |   486   (2)| 00:00:06 |
|*  1 |  HASH JOIN         |      |    45 |  3150 |   486   (2)| 00:00:06 |
|*  2 |   TABLE ACCESS FULL| DEPT |     4 |   140 |     3   (0)| 00:00:01 |
|*  3 |   TABLE ACCESS FULL| EMP  |   111 |  3885 |   483   (2)| 00:00:06 |
---------------------------------------------------------------------------


PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("EMP"."DEPTID"="DEPT"."ID")
   2 - filter("DEPT"."ID"=2 OR "DEPT"."ID"=3 OR "DEPT"."ID"=4 OR
              "DEPT"."ID"=5)
   3 - filter("EMP"."NAME" LIKE U'AK%' AND ("EMP"."DEPTID"=2 OR
              "EMP"."DEPTID"=3 OR "EMP"."DEPTID"=4 OR "EMP"."DEPTID"=5))

Note
-----

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
   - dynamic sampling used for this statement (level=2)

已选择23行。

看，即使SQL里强制先连接表再过滤，知道最佳方案是怎样的CBO是不会舍近求远的，它还是选择最佳路径前进，所以cost依旧是486.

看到这里，你会知道过于纠结一些SQL的写法无意义，CBO会知道如何优化，就像java虚拟机对字符串+的优化；另一方面，SQL跑得顺溜也未必是SQL写得正确高效，或许是CBO在背后默默奉献。

这个例子只是简单情况，所以CBO能一直坚持最短路径，但复杂情况下未必合理，我们还是要秉承先过滤后连接的方针不变，就像第二种方案里的SQL那样写。

如果给emp表的name单列加索引会是什么效果呢？cost是增长还是有效降低？。

让我们先加索引，再对第二SQL(先过滤后连接)查看解释计划。

SQL> create index idx_emp_name on emp(name);

索引已创建。

SQL> explain plan for select a.name,b.name from (select * from emp where name like 'AK%') a ,(select * from dept where id in (2,3,4,5)) b where a.deptid=b.id;

已解释。

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 262175135

--------------------------------------------------------------------------------
-------------

| Id  | Operation                    | Name         | Rows  | Bytes | Cost (%CPU
)| Time     |

--------------------------------------------------------------------------------
-------------


PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |              |    45 |  3150 |   467   (1
)| 00:00:06 |

|*  1 |  HASH JOIN                   |              |    45 |  3150 |   467   (1
)| 00:00:06 |

|*  2 |   TABLE ACCESS FULL          | DEPT         |     4 |   140 |     3   (0
)| 00:00:01 |

|*  3 |   TABLE ACCESS BY INDEX ROWID| EMP          |   111 |  3885 |   463   (0
)| 00:00:06 |

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------

|*  4 |    INDEX RANGE SCAN          | IDX_EMP_NAME |   362 |       |     5   (0
)| 00:00:01 |

--------------------------------------------------------------------------------
-------------


Predicate Information (identified by operation id):
---------------------------------------------------


PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
   1 - access("EMP"."DEPTID"="DEPT"."ID")
   2 - filter("ID"=2 OR "ID"=3 OR "ID"=4 OR "ID"=5)
   3 - filter("EMP"."DEPTID"=2 OR "EMP"."DEPTID"=3 OR "EMP"."DEPTID"=4 OR
              "EMP"."DEPTID"=5)
   4 - access("NAME" LIKE U'AK%')
       filter("NAME" LIKE U'AK%')

Note
-----
   - dynamic sampling used for this statement (level=2)

已选择25行。

看来对name单列加索引对半模糊查询的优化效果极其有限。

下面我们用instr函数取代 like ‘AK%’，看是否有惊喜。

SQL：

select a.name,b.name from (select * from emp where instr(name,'AK')=1) a ,(select * from dept where id in (2,3,4,5)) b where a.deptid=b.id;

执行情况：

SQL> create index idx_emp_name on emp(name);

索引已创建。

SQL> explain plan for select a.name,b.name from (select * from emp where instr(name,'AK')=1) a ,(select * from dept where id in (2,3,4,5)) b where a.deptid=b.id;

已解释。

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 615168685

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    45 |  3150 |   487   (2)| 00:00:06 |
|*  1 |  HASH JOIN         |      |    45 |  3150 |   487   (2)| 00:00:06 |
|*  2 |   TABLE ACCESS FULL| DEPT |     4 |   140 |     3   (0)| 00:00:01 |
|*  3 |   TABLE ACCESS FULL| EMP  |   111 |  3885 |   484   (2)| 00:00:06 |
---------------------------------------------------------------------------


PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("EMP"."DEPTID"="DEPT"."ID")
   2 - filter("ID"=2 OR "ID"=3 OR "ID"=4 OR "ID"=5)
   3 - filter(INSTR("NAME",U'AK')=1 AND ("EMP"."DEPTID"=2 OR
              "EMP"."DEPTID"=3 OR "EMP"."DEPTID"=4 OR "EMP"."DEPTID"=5))

Cost反而上升了1，看来对name加单列索引，对查询的改变都很有限。

下面我们对emp表的name和deptid字段加联合索引，看是否有改善。

SQL> create index idx_emp_name_deptid on emp(name,deptid);

索引已创建。

SQL> explain plan for select a.name,b.name from (select * from emp where instr(name,'AK')=1) a ,(select * from dept where id in (2,3,4,5)) b where a.deptid=b.id;

已解释。

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 615168685

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    45 |  3150 |   487   (2)| 00:00:06 |
|*  1 |  HASH JOIN         |      |    45 |  3150 |   487   (2)| 00:00:06 |
|*  2 |   TABLE ACCESS FULL| DEPT |     4 |   140 |     3   (0)| 00:00:01 |
|*  3 |   TABLE ACCESS FULL| EMP  |   111 |  3885 |   484   (2)| 00:00:06 |
---------------------------------------------------------------------------


PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("EMP"."DEPTID"="DEPT"."ID")
   2 - filter("ID"=2 OR "ID"=3 OR "ID"=4 OR "ID"=5)
   3 - filter(INSTR("NAME",U'AK')=1 AND ("EMP"."DEPTID"=2 OR
              "EMP"."DEPTID"=3 OR "EMP"."DEPTID"=4 OR "EMP"."DEPTID"=5))

Note
-----
   - dynamic sampling used for this statement (level=2)

已选择22行。

结果和对name加单列索引是一样的，cost还上升了1.

下面我们把dept的in查询改成范围查询，还是用上面创建的联合索引，看看情况如何。

SQL：

select a.name,b.name from (select * from emp where instr(name,'AK')=1) a ,(select * from dept where id>1 and id<6 ) b where a.deptid=b.id;

执行：

SQL> explain plan for select a.name,b.name from (select * from emp where instr(name,'AK')=1) a ,(select * from dept where id>1 and id<6 ) b where a.deptid=b.id;

已解释。

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 615168685

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    45 |  3150 |   485   (2)| 00:00:06 |
|*  1 |  HASH JOIN         |      |    45 |  3150 |   485   (2)| 00:00:06 |
|*  2 |   TABLE ACCESS FULL| DEPT |     4 |   140 |     3   (0)| 00:00:01 |
|*  3 |   TABLE ACCESS FULL| EMP  |   111 |  3885 |   482   (2)| 00:00:06 |
---------------------------------------------------------------------------


PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("EMP"."DEPTID"="DEPT"."ID")
   2 - filter("ID">1 AND "ID"<6)
   3 - filter("EMP"."DEPTID">1 AND "EMP"."DEPTID"<6 AND
              INSTR("NAME",U'AK')=1)

Note
-----
   - dynamic sampling used for this statement (level=2)

已选择22行。

从Cost上看，少了1，是聊胜于无的改善。

再换一种, 用between and 取代<>, 注意between and是包括上下边界的。

select a.name,b.name from (select * from emp where instr(name,'AK')=1) a ,(select * from dept where id between 2 and 5 ) b where a.deptid=b.id;

执行情况：

SQL> explain plan for select a.name,b.name from (select * from emp where instr(name,'AK')=1) a ,(select * from dept where id between 2 and 5 ) b where a.deptid=b.id;

已解释。

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 615168685

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    45 |  3150 |   485   (2)| 00:00:06 |
|*  1 |  HASH JOIN         |      |    45 |  3150 |   485   (2)| 00:00:06 |
|*  2 |   TABLE ACCESS FULL| DEPT |     4 |   140 |     3   (0)| 00:00:01 |
|*  3 |   TABLE ACCESS FULL| EMP  |   111 |  3885 |   482   (2)| 00:00:06 |
---------------------------------------------------------------------------


PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("EMP"."DEPTID"="DEPT"."ID")
   2 - filter("ID">=2 AND "ID"<=5)
   3 - filter("EMP"."DEPTID">=2 AND "EMP"."DEPTID"<=5 AND
              INSTR("NAME",U'AK')=1)

Note
-----
   - dynamic sampling used for this statement (level=2)

已选择22行。

从Cost上看，这个方案比大于小于方案没有改进，还是485.

就这个简单例子而言，CBO的优化已经做到了极限，以致于改进的空间都极其有限了。

END