PostgreSQL中的not in、not exists、left join/is null

哪种方法是找出在一张表中，而不在另一张表中的记录的最佳方法呢？

SELECT  l.*
FROM    t_left l
LEFT JOIN
        t_right r
ON      r.value = l.value
WHERE   r.value IS NULL;

SELECT  l.*
FROM    t_left l
WHERE   l.value NOT IN
        (
        SELECT  value
        FROM    t_right r
        );

SELECT l.*
FROM  t_left l
WHERE  NOT EXISTS
    (
    SELECT NULL
    FROM  t_right r
    WHERE  r.value = l.value
    );

我们先把环境准备一下：

postgres 11.9

CREATE TABLE t_left (
    id INT NOT NULL PRIMARY KEY,
    value INT NOT NULL,
    stuffing VARCHAR(200) NOT NULL
);
CREATE TABLE t_right (
    id INT NOT NULL PRIMARY KEY,
    value INT NOT NULL,
    stuffing VARCHAR(200) NOT NULL
);

CREATE INDEX ix_left_value ON t_left (value);
CREATE INDEX ix_right_value ON t_right (value);

INSERT
INTO  t_left
SELECT s, s % 10000, RPAD('Value '|| s || ' ', 200, '*')
FROM generate_series(1, 100000) s;

INSERT
INTO  t_right
SELECT s, s % 10000 + 1, RPAD('Value '|| s || ' ', 200, '*')
FROM  generate_series(1, 1000000) s;

ANALYZE t_left;
ANALYZE t_right;

表t_left包含100000行记录，其中有1000个不同的记录；

只是测试

表t_right包含100000行记录，其中有1000个不同的记录。

t_left表有10条记录不存在与t_right表中。

NOT IN

先看看not in

postgres=# SELECT  l.id, l.value
postgres-# FROM    t_left l
postgres-# WHERE   value NOT IN
postgres-#         (
postgres(#         SELECT  value
postgres(#         FROM    t_right r
postgres(#         );
   id   | value 
--------+-------
  20000 |     0
  10000 |     0
  60000 |     0
  30000 |     0
  40000 |     0
  50000 |     0
  70000 |     0
  80000 |     0
  90000 |     0
 100000 |     0
(10 rows)

postgres=# explain analyze SELECT  l.id, l.value
postgres-# FROM    t_left l
postgres-# WHERE   value NOT IN
postgres-#         (
postgres(#         SELECT  value
postgres(#         FROM    t_right r
postgres(#         );
                                                               QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..1520921531.00 rows=50000 width=8) (actual time=7329.726..40554.033 rows=10 loops=1)
   Workers Planned: 1
   Workers Launched: 1
   ->  Parallel Seq Scan on t_left l  (cost=0.00..1520915531.00 rows=29412 width=8) (actual time=5506.368..40405.539 rows=5 loops=2)
         Filter: (NOT (SubPlan 1))
         Rows Removed by Filter: 49995
         SubPlan 1
           ->  Materialize  (cost=0.00..49211.00 rows=1000000 width=4) (actual time=0.001..0.534 rows=4084 loops=100000)
                 ->  Seq Scan on t_right r  (cost=0.00..40304.00 rows=1000000 width=4) (actual time=0.002..0.575 rows=3739 loops=37433)
 Planning Time: 0.155 ms
 Execution Time: 40557.112 ms
(11 rows)

postgres=#

这里之所以先讨论not in，是因为其对待null的方式特殊。

不幸的是，postgresql优化器不能利用到t_right.value被定义为not null的事实。因此，不可以返回null值。(即not in不能返回null值)

这里可以做一个小测试：

postgres=# create table aa(id int,age int);
postgres=# insert into aa values(1,1);
postgres=# insert into aa values(2,2);
postgres=# insert into aa(id) values(3);
postgres=# select * from aa where age not in (1,2);
 id | age 
----+-----
(0 rows)

postgres=# select * from aa where age not in (1);
 id | age 
----+-----
  2 |   2
(1 row)

postgres=#

这就是为什么postgresql对not in查询使用一个特殊的访问方法。

NOT EXISTS

postgres=# SELECT  l.id, l.value
postgres-# FROM    t_left l
postgres-# WHERE   NOT EXISTS
postgres-#         (
postgres(#         SELECT  select
postgres(#         FROM    t_right r
postgres(#         WHERE   r.value = l.value
postgres(#         );
   id   | value 
--------+-------
  10000 |     0
  30000 |     0
  20000 |     0
  40000 |     0
  60000 |     0
  50000 |     0
  80000 |     0
  70000 |     0
  90000 |     0
 100000 |     0
(10 rows)

postgres=# explain analyze SELECT  l.id, l.value
postgres-# FROM    t_left l
postgres-# WHERE   NOT EXISTS
postgres-#         (
postgres(#         SELECT  value
postgres(#         FROM    t_right r
postgres(#         WHERE   r.value = l.value
postgres(#         );
                                                                    QUERY PLAN                                                                    
--------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.42..34410.22 rows=10 width=8) (actual time=31.555..262.307 rows=10 loops=1)
   Workers Planned: 1
   Workers Launched: 1
   ->  Nested Loop Anti Join  (cost=0.42..33409.22 rows=6 width=8) (actual time=43.314..257.792 rows=5 loops=2)
         ->  Parallel Seq Scan on t_left l  (cost=0.00..3619.24 rows=58824 width=8) (actual time=0.007..10.066 rows=50000 loops=2)
         ->  Index Only Scan using ix_right_value on t_right r  (cost=0.42..3.53 rows=100 width=4) (actual time=0.005..0.005 rows=1 loops=100000)
               Index Cond: (value = l.value)
               Heap Fetches: 99990
 Planning Time: 0.328 ms
 Execution Time: 262.348 ms
(10 rows)

postgres=#

LEFT JOIN / IS NULL

postgres=# SELECT  l.id, l.value
postgres-# FROM    t_left l
postgres-# WHERE   NOT EXISTS
postgres-#         (
postgres(#         SELECT  value
postgres(#         FROM    t_right r
postgres(#         WHERE   r.value = l.value
postgres(#         );
   id   | value 
--------+-------
  80000 |     0
  10000 |     0
  20000 |     0
  30000 |     0
  40000 |     0
  50000 |     0
  60000 |     0
  70000 |     0
  90000 |     0
 100000 |     0
(10 rows)

postgres=# explain analyze SELECT  l.id, l.value
postgres-# FROM    t_left l
postgres-# WHERE   NOT EXISTS
postgres-#         (
postgres(#         SELECT  value
postgres(#         FROM    t_right r
postgres(#         WHERE   r.value = l.value
postgres(#         );
                                                                    QUERY PLAN                                                                    
--------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.42..34410.22 rows=10 width=8) (actual time=32.109..259.553 rows=10 loops=1)
   Workers Planned: 1
   Workers Launched: 1
   ->  Nested Loop Anti Join  (cost=0.42..33409.22 rows=6 width=8) (actual time=43.744..255.254 rows=5 loops=2)
         ->  Parallel Seq Scan on t_left l  (cost=0.00..3619.24 rows=58824 width=8) (actual time=0.008..9.777 rows=50000 loops=2)
         ->  Index Only Scan using ix_right_value on t_right r  (cost=0.42..3.53 rows=100 width=4) (actual time=0.005..0.005 rows=1 loops=100000)
               Index Cond: (value = l.value)
               Heap Fetches: 99990
 Planning Time: 0.396 ms
 Execution Time: 259.608 ms
(10 rows)

postgres=#

postgresql按照相同的方式对待left join和not exists，使用相同的执行计划(nested loop anti join)。

至于NOT IN，这在语义上是不同的, PostgreSQL试图考虑这一点，并限制自己对子计划使用过滤器。