**Repeated DNA Sequences

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:
["AAAAACCCCC", "CCCCCAAAAA"].

解法:

考虑到只有4种字母,ACGT,固定10位字符,所以排列组合数是一定的,仅有4^10 = 1048576种可能。因此,可以让10位字符序列做一个Hash。

A = 00
C = 01
G = 10
T = 11

将他们拼接起来,变成一个数value

如: AAACCC = 00 00 00 01 01 01 (二进制) = 21 (十进制)

然后遍历整个序列,每10个字符生成一个value。因为只需要判断10个字符的值,每个字符占2位,所以我们只需要20位的bit.

code中sum*4相当于左移2位。

/**
 * 本代码由九章算法编辑提供。没有版权欢迎转发。
 * - 九章算法致力于帮助更多中国人找到好的工作,教师团队均来自硅谷和国内的一线大公司在职工程师。
 * - 现有的面试培训课程包括:九章算法班,系统设计班,BAT国内班
 * - 更多详情请见官方网站:http://www.jiuzhang.com/
 */

public class Solution {
    public int encode(String s) {
        int sum = 0;
        for (int i = 0; i < s.length(); i++) {
            if (s.charAt(i) == 'A') {
                sum = sum * 4;
            } else if (s.charAt(i) == 'C') {
                sum = sum * 4 + 1;
            } else if (s.charAt(i) == 'G') {
                sum = sum * 4 + 2;
            } else {
                sum = sum * 4 + 3;
            }
        }
        return sum;
    }
    public List<String> findRepeatedDnaSequences(String s) {
        HashSet<Integer> hash = new HashSet<Integer>();
        HashSet<String> dna = new HashSet<String>();
        for (int i = 9; i < s.length(); i++) {
            String subString = s.substring(i - 9, i + 1);
            int encoded = encode(subString);
            if (hash.contains(encoded)) {
                dna.add(subString);
            } else {
                hash.add(encoded);
            }
        }
        List<String> result = new ArrayList<String>();
        for (String d: dna) {
            result.add(d);
        }
        return result;
    }
}
原文地址:https://www.cnblogs.com/hygeia/p/5093166.html