sql去重

最近在做公司的数据导入工作，写程序把Excel数据导入到数据库。

导入完成后发现有很多重复数据，现在要把这些数据去重，数据量约800万条，其中有些数据全部相同，有些部分相同。

经过需求确认分析，删除条件为：手机（mobilePhone）+电话（officePhone）+邮件（email）都相同的数据

最开始使用以下2条sql语句进行去重

delete from 表 where id not in (
select max(id) from 表 group by mobilePhone,officePhone,email )

delete from 表 where id not in (
select min(id) from 表 group by mobilePhone,officePhone,email )

其中下面这条会稍快些

上面这条数据对于100万以内的数据，重复数1/5的情况下几分钟到几十分钟不等

但是如果数据量达到300万以上，常常会几十小时跑不完，有时候会锁表跑一夜都跑不完

无奈只得重新寻找新的可行方法，今天终于有所收获：

//查询出唯一数据的ID,并把他们导入临时表tmp中
select min(id) as mid into tmp from 表 group by mobilePhone,officePhone,email

//查询出去重后的数据并插入finally表中
insert into finally select (除ID以外的字段) from customers_1 where id in (select mid from tmp)

效率对比：用delete方法对500万数据去重（1/2重复）约4小时

用临时表插入对500万数据去重（1/2重复）不到10分钟

经过园友@牛奶哥的指导，使用row_number() over(partition by 函数进行去重：

with test as
(
select ROW_NUMBER() over(partition by mobilePhone,officePhone,email order by id) as num, *
from customers_1
)
 
delete from test
where num != 1

测试结果：800万数据，其中重复的35万，耗时4分钟