求一批字符串里出现最多次数的子串

在前作 https://www.cnblogs.com/xiandedanteng/p/12713057.html 里,探讨了相同字符串前缀出现次数最多的问题,其实这是求多字符串子串问题的特例。

本篇准备直接解决求多字符串共同子串的问题,不管子串是出现在开头抑或中间还是结尾:

这批字符串存储在表tb_string里,具体就是将其中出现最多的子串按出现次数降序排列。

tb_string表结构:

create table tb_string(
    id number(4,0) primary key,
    name nvarchar2(20) not null)

测试数据:

insert into tb_string(id,name) values('1','asd_pdf_df3');
insert into tb_string(id,name) values('2','star_asd');
insert into tb_string(id,name) values('3','starasd');
insert into tb_string(id,name) values('4','pdfdf3asd');
insert into tb_string(id,name) values('5','startpdf');

 目测一下,貌似df出现5次,pdf出现3次,asd出现四次...这是出现频率较高的子串.

解决方法:

将取到的字符串用substr函数从头到尾截取子串,比如字符串abc,截取的子串将是a,b,c,ab,bc,abc,一共六个;再比如asdf,截取的子串将是a,s,d,f,as,sd,df,asd,sdf,asdf,一共十个.

然后,将子串统计排序就好。

为了截取子串,需要下表辅助:

create table tb_numsery(
    id number(4,0) primary key,
    startIdx integer not null,
    span integer not null)

其中值:

insert into tb_numsery(id,startIdx,span) values('1','1','1');
insert into tb_numsery(id,startIdx,span) values('2','2','1');
insert into tb_numsery(id,startIdx,span) values('3','3','1');
insert into tb_numsery(id,startIdx,span) values('4','4','1');
insert into tb_numsery(id,startIdx,span) values('5','5','1');
insert into tb_numsery(id,startIdx,span) values('6','6','1');
insert into tb_numsery(id,startIdx,span) values('7','7','1');
insert into tb_numsery(id,startIdx,span) values('8','8','1');
insert into tb_numsery(id,startIdx,span) values('9','9','1');
insert into tb_numsery(id,startIdx,span) values('10','10','1');
insert into tb_numsery(id,startIdx,span) values('11','1','2');
insert into tb_numsery(id,startIdx,span) values('12','2','2');
insert into tb_numsery(id,startIdx,span) values('13','3','2');
insert into tb_numsery(id,startIdx,span) values('14','4','2');
insert into tb_numsery(id,startIdx,span) values('15','5','2');
insert into tb_numsery(id,startIdx,span) values('16','6','2');
insert into tb_numsery(id,startIdx,span) values('17','7','2');
insert into tb_numsery(id,startIdx,span) values('18','8','2');
insert into tb_numsery(id,startIdx,span) values('19','9','2');
insert into tb_numsery(id,startIdx,span) values('20','1','3');
insert into tb_numsery(id,startIdx,span) values('21','2','3');
insert into tb_numsery(id,startIdx,span) values('22','3','3');
insert into tb_numsery(id,startIdx,span) values('23','4','3');
insert into tb_numsery(id,startIdx,span) values('24','5','3');
insert into tb_numsery(id,startIdx,span) values('25','6','3');
insert into tb_numsery(id,startIdx,span) values('26','7','3');
insert into tb_numsery(id,startIdx,span) values('27','8','3');
insert into tb_numsery(id,startIdx,span) values('28','1','4');
insert into tb_numsery(id,startIdx,span) values('29','2','4');
insert into tb_numsery(id,startIdx,span) values('30','3','4');
insert into tb_numsery(id,startIdx,span) values('31','4','4');
insert into tb_numsery(id,startIdx,span) values('32','5','4');
insert into tb_numsery(id,startIdx,span) values('33','6','4');
insert into tb_numsery(id,startIdx,span) values('34','7','4');
insert into tb_numsery(id,startIdx,span) values('35','1','5');
insert into tb_numsery(id,startIdx,span) values('36','2','5');
insert into tb_numsery(id,startIdx,span) values('37','3','5');
insert into tb_numsery(id,startIdx,span) values('38','4','5');
insert into tb_numsery(id,startIdx,span) values('39','5','5');
insert into tb_numsery(id,startIdx,span) values('40','6','5');
insert into tb_numsery(id,startIdx,span) values('41','1','6');
insert into tb_numsery(id,startIdx,span) values('42','2','6');
insert into tb_numsery(id,startIdx,span) values('43','3','6');
insert into tb_numsery(id,startIdx,span) values('44','4','6');
insert into tb_numsery(id,startIdx,span) values('45','5','6');
insert into tb_numsery(id,startIdx,span) values('46','1','7');
insert into tb_numsery(id,startIdx,span) values('47','2','7');
insert into tb_numsery(id,startIdx,span) values('48','3','7');
insert into tb_numsery(id,startIdx,span) values('49','4','7');
insert into tb_numsery(id,startIdx,span) values('50','1','8');
insert into tb_numsery(id,startIdx,span) values('51','2','8');
insert into tb_numsery(id,startIdx,span) values('52','3','8');
insert into tb_numsery(id,startIdx,span) values('53','1','9');
insert into tb_numsery(id,startIdx,span) values('54','2','9');
insert into tb_numsery(id,startIdx,span) values('55','1','10');

这些值里,第一列用来做id,第二列用来指定substr的第二参数起始点,第三列用来指定截取的宽度。用这个表辅助从名称里截取子串方便多了,不像原方案一层层剥去有点费劲.

仔细观察这些值,你会发现它与上面从abc,abcd截取子串的过程暗合,其中道理请自行揣摩。

这些值可以由下面的程序得到:

package sery;

public class SeryMaker {
    public static void main(String[] args) {
        int n=10;
        int level=1;
        int index=1;
        for(int i=n;i>0;i--) {
            for(int j=1;j<=i;j++) {
                System.out.println(getInsert(index,j,level));
                index++;
            }
            
            level++;
        }
    }
    
    private static String getInsert(int index,int j,int level) {
        return String.format("insert into tb_numsery(id,startIdx,span) values('%d','%d','%d');",index,j,level);
    }
}

有了tb_numsery表的帮助,最终的SQL语句就出来了:

select d.* from
(
    select c.child,count(*) as cnt from
    (
        select substr(a.name,b.startIdx,b.span) as child from
        tb_string a,tb_numsery b
        where b.startIdx<=length(a.name)
        and b.span<=length(a.name)
        and b.startIdx+b.span<=length(a.name)+1
    ) c
    group by c.child
) d
where d.cnt>1 and length(d.child)>1
order by d.cnt desc,d.child

解读:最内层让两表做笛卡儿积,目的是利用tb_numsery表里的数据截取tb_string表的子串;中层是按子串进行分组,得到出现次数,最外层则是过滤和排序。

最外层where条件对出现次数等于1和子串长度等于1的数据进行了过滤,你也可以根据实际情况进行调整。

结果:

SQL> select d.* from
  2  (
  3  select c.child,count(*) as cnt from
  4  (
  5  select substr(a.name,b.startIdx,b.span) as child from
  6  tb_string a,tb_numsery b
  7  where b.startIdx<=length(a.name)
  8  and b.span<=length(a.name)
  9  and b.startIdx+b.span<=length(a.name)+1
 10  ) c
 11  group by c.child
 12  ) d
 13  where d.cnt>1 and length(d.child)>1
 14  order by d.cnt desc,d.child;

CHILD                                           CNT
---------------------------------------- ----------
df                                                5
as                                                4
asd                                               4
sd                                                4
ar                                                3
pd                                                3
pdf                                               3
st                                                3
sta                                               3
star                                              3
ta                                                3

CHILD                                           CNT
---------------------------------------- ----------
tar                                               3

已选择12行。

目测一下df和as出现次数,和实际情况是符合的。

在我的T440p机器上运行一下,得到结果不到0.01秒,借助oracle的强大性能,可能比一些Java算法要快了。

--2020年4月20日--

原文地址:https://www.cnblogs.com/heyang78/p/12740440.html