187. Repeated DNA Sequences

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:
["AAAAACCCCC", "CCCCCAAAAA"].
题目含义:找出给出字符串里面,连续10个字母出现多次的串。

 1     public List<String> findRepeatedDnaSequences(String s) {
 2 //        http://www.cnblogs.com/grandyang/p/4284205.html
 3 //        A: 0100 0001  C: 0100 0011  G: 0100 0111  T: 0101 0100,目的是利用位来区分字符,当然是越少位越好,通过观察发现,每个字符的后三位都不相同,
 4 //        故而我们可以用末尾三位来区分这四个字符。而题目要求是10个字符长度的串,每个字符用三位来区分,10个字符需要30位,在32位机上也OK。
 5 //        为了提取出后30位,我们还需要用个mask,取值为0x7ffffff,用此mask可取出后27位,再向左平移三位即可。
 6 //        算法的思想是,当取出第十个字符时,将其存在Set里,之后每向左移三位替换一个字符,查找新cur在set中是否出现,
 7 //        如果之前刚好出现过一次,则将当前字符串(从i-10到i)存入返回值的数组,如果从未出现过,则将cur添加到set中。
 8         Set<String> res = new HashSet<>();
 9         if (s.length()<=10) return new ArrayList<>();
10         int mask = 0x7ffffff;
11         Set<Integer> set = new HashSet<>();
12         int cur=0,i=0;
13         while (i<9)
14         {
15             cur = (cur<<3)|(s.charAt(i++)&7);
16         }
17         while (i<s.length())
18         {
19             cur = ((cur & mask) << 3) | (s.charAt(i++) & 7);
20             if (set.contains(cur))
21             {
22                 res.add(s.substring(i-10,i));
23             }
24             else set.add(cur);
25         }
26         return new ArrayList<>(res) ;     
27     }
原文地址:https://www.cnblogs.com/wzj4858/p/7722618.html