Repeated DNA Sequences

题目:

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:
["AAAAACCCCC", "CCCCCAAAAA"].

思路:

这个题上来打算用hashmap把所有的子序列都放进去,找相同的。是不是有点天真了。。

查了一下,可以用bit mask来做,恍然大悟。bit manipulation总是可以让代码变得很简洁。

直接看代码吧。也没什么巧妙地地方,只是用两个bit代表一个字母。

A:00 

C:01

G:  10

T:  11

 1     public List<String> findRepeatedDnaSequences(String s) {
 2         List<String> res = new ArrayList<String>();
 3         if (s.length() <= 10) {
 4             return res;
 5         }
 6         HashMap<Character, Integer> map = new HashMap<Character, Integer>();
 7         map.put('A', 0);
 8         map.put('C', 1);
 9         map.put('G', 2);
10         map.put('T', 3);
11         
12         int mask = 0;
13         
14         Set<Integer> subSeq = new HashSet<Integer>();
15         Set<Integer> addedSubSeq = new HashSet<Integer>();
16         
17         for (int i = 0; i < s.length(); i++) {
18             if (i < 9) {
19                 mask = mask << 2;
20                 mask += map.get(s.charAt(i));
21             }else{
22                 mask = mask << 2;
23                 mask += map.get(s.charAt(i));
24                 
25                 //we only need 20 bits for 10 letter
26                 mask = mask << 12;
27                 mask = mask >>> 12;
28                 if (subSeq.contains(mask) && !addedSubSeq.contains(mask)) {
29                     res.add(s.substring(i - 9, i + 1));
30                     addedSubSeq.add(mask);
31                 } else {
32                     subSeq.add(mask);
33                 }
34             }
35         }
36         return res;
37     }
原文地址:https://www.cnblogs.com/gonuts/p/4475203.html