sql中的exists

刚开始工作的开发，很容易过度的使用in、not in。其实，在有些时候可以用exists、not exists来代替in和not in，实现查询性能的提升。

exists操作符时候会和in操作符产生混淆。因为他们都是用于检查某个属性是否包含在某个集合中。但是相同点也仅限于此。

exists的真正目的是检查子查询是否至少包含一条记录。例如，下面的查询会返回行1和2：

WITH numbers (nr) AS (
  SELECT 1 AS nr UNION ALL
  SELECT 2 AS nr UNION ALL
  SELECT 3 AS nr
), letters (letter, nr) AS (
  SELECT 'A' AS letter, 1 AS nr UNION ALL 
  SELECT 'B' AS letter, 2 AS nr 
)
SELECT * FROM numbers n WHERE  EXISTS (SELECT nr FROM letters WHERE nr= n.nr);

当然，你也可以改写成in：

WITH numbers (nr) AS (
  SELECT 1 AS nr UNION ALL
  SELECT 2 AS nr UNION ALL
  SELECT 3 AS nr
), letters (letter, nr) AS (
  SELECT 'A' AS letter, 1 AS nr UNION ALL 
  SELECT 'B' AS letter, 2 AS nr 
)
SELECT * FROM numbers n WHERE n.nr IN (SELECT nr FROM letters);

这两种写法，都可以返回相同的记录。区别是exists会更快，因为在得到第一条满足条件的记录之后就会停止，而in会查询所有的记录(如果in返回很多行的话)。但是，要注意另一件事，exists使用的关联子查询，它可以使用left join加上一个NULL检查来替换。

执行计划

下面来看看，示例的执行计划

not in操作符的执行计划：

=# EXPLAIN ANALYZE SELECT * FROM products WHERE id NOT IN (SELECT product_id FROM bought_products);
                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Seq Scan on products  (cost=38.25..50.50 rows=90 width=422) (actual time=0.069..0.072 rows=5 loops=1)
   Filter: (NOT (hashed SubPlan 1))
   Rows Removed by Filter: 2
   SubPlan 1
     ->  Seq Scan on bought_products  (cost=0.00..32.60 rows=2260 width=4) (actual time=0.006..0.009 rows=3 loops=1)
 Planning Time: 0.180 ms
 Execution Time: 0.130 ms
(7 rows)

从执行计划可以看到，因为没有用到任何索引，对表bought_products先走的是全表扫描，然后在参与过滤。

not exists的执行计划：

=# EXPLAIN ANALYZE SELECT * FROM products WHERE NOT EXISTS (SELECT product_id FROM bought_products WHERE id = product_id);
                                                                    QUERY PLAN                                                                    
--------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop Anti Join  (cost=0.15..47.80 rows=90 width=422) (actual time=0.038..0.045 rows=5 loops=1)
   ->  Seq Scan on products  (cost=0.00..11.80 rows=180 width=422) (actual time=0.012..0.014 rows=7 loops=1)
   ->  Index Only Scan using bought_products_pkey on bought_products  (cost=0.15..0.61 rows=11 width=4) (actual time=0.003..0.003 rows=0 loops=7)
         Index Cond: (product_id = products.id)
         Heap Fetches: 2
 Planning Time: 0.177 ms
 Execution Time: 0.080 ms
(7 rows)

这里走的是，嵌套循环反连接(Nested Loop Anti Join)。

从这两个执行计划来看，最大的区别是执行时间的不同(0.130 ms vs 0.080 ms)，计划时间差不多。

所以，当你有些in查询时，可以看看是否可以使用exists来替换。

再来看看exists的执行计划：

=# EXPLAIN ANALYZE SELECT * FROM products WHERE EXISTS (SELECT product_id FROM bought_products WHERE id = product_id);
                                                                    QUERY PLAN                                                                    
--------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop Semi Join  (cost=0.15..47.80 rows=90 width=422) (actual time=0.030..0.046 rows=2 loops=1)
   ->  Seq Scan on products  (cost=0.00..11.80 rows=180 width=422) (actual time=0.011..0.013 rows=7 loops=1)
   ->  Index Only Scan using bought_products_pkey on bought_products  (cost=0.15..0.61 rows=11 width=4) (actual time=0.003..0.003 rows=0 loops=7)
         Index Cond: (product_id = products.id)
         Heap Fetches: 2
 Planning Time: 0.261 ms
 Execution Time: 0.084 ms
(7 rows)

正如我们所料，执行计划中使用了半连接(SEMI JOIN)。

如果不是关联子查询呢？

=# EXPLAIN ANALYZE SELECT * FROM products WHERE EXISTS (SELECT product_id FROM bought_products);
                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Result  (cost=0.01..11.81 rows=180 width=422) (actual time=0.013..0.015 rows=7 loops=1)
   One-Time Filter: $0
   InitPlan 1 (returns $0)
     ->  Seq Scan on bought_products  (cost=0.00..32.60 rows=2260 width=0) (actual time=0.007..0.007 rows=1 loops=1)
   ->  Seq Scan on products  (cost=0.01..11.81 rows=180 width=422) (actual time=0.002..0.004 rows=7 loops=1)
 Planning Time: 0.077 ms
 Execution Time: 0.036 ms
(7 rows)

这次，子查询使用了顺序扫描，但是计划时间只用了0.077ms。它将只返回1行，因为它是EXISTS的语义。

如果你将这些统计信息和not in进行比较，可能会发现它们都差不多，可能是因为它们都是小的数据集。对于大的数据集，它们的差异会很明显，尤其是not in会返回大量行的场景。(exists只会返回1行记录)

尽管exists和in看起来类似，但是他们是不同的。exists在子查询返回第一条满足条件的记录后就会停止。如果你的查询很慢，将in换成exists可能会提升很多性能。

此外，exists可以用left join加上null检查来取代。