Repeated DNA Sequences 【9ms未解决】

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:
["AAAAACCCCC", "CCCCCAAAAA"].

思路：1.用map来存储字符序列。2.检查序列是否已经存在在map中。如果存在且count=1，就将序列添加到结果中

注意：map<string，int> 会造成 memory limits exceed
详细见：C++中基本类型的字节数。

解决1，将A,C,G,T替换成0,1,2,3. 注意：最长为3333333333> 2³¹-1 (2147483647)  所以不能用int,会溢出。
      选择map<unsigned int, bool>  (因为bool为1个字节，int为4个字节，节省空间)

解法2，用 bit manipulation. 观察A,C,G,T 的ascii码
      ASCII    后三位，   倒数后两位
      A：65，   001        00
      C：67，   011        01
      G：71，   111        11
      T：84，   100        10
（1）标示符取后三位，则长度为10的字符串，可以用30bit来表示。    t<<3 & 0x3FFFFFFF | s[i]&7
（2）标示符取倒数后两位，则长度为10的字符串，可以用20bit来表示。 t<<3 & 0xFFFFF | s[i]>>1&0x3
 因此，采用map<int,bool>

解法3，不采用map来查找，而是用另一种方法。leetcode上一位大牛写了9ms的方法！！！

总结：1.仔细观察特征 2.时时刻刻想着去节省空间。

这是我的解法2

class Solution {
public:
      vector<string> findRepeatedDnaSequences(string s) {
       //check validation
       vector<string> res;
       if(s.empty()) return res;
       //check special case
       int n=s.length();
       if(n<10) return res;
       
       //general case: to reduce the memory
       unordered_map<int,bool> map;
       int t=0;
       for(int i=0;i<n;i++){
           t = (t<<2) & 0xFFFFF|(s[i]>>1 & 0x3);
           if(i>=9){
               if(map.count(t)){
                   if(map[t]==true){
                       res.push_back(s.substr(i-9,10));
                       map[t]=false;
                   }
               }else map[t]=true;
           }
       }
       return res;
    }
};