算法作業HW15：LeetCode187 Repeated DNA Sequences

原創

2020-02-21 02:04

Description:

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

Note:

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:
["AAAAACCCCC", "CCCCCAAAAA"].

Solution:

Analysis and Thinking:

問題要求我們寫一個函數，實現找到所有長度爲10的重複字符串（在輸入中出現次數超過兩次）。這裏爲了便於對輸入數據進行處理，採取了字符->二進制數字映射操作，即對於ACGT這四個代表不同鹼基類型的字符，用00,01,10,11來代替，從而把長度位10的字符串轉換爲20位整數值，從而便於處理，改變存儲的容器類型還能一定程度減小空間複雜性。

問題的求解可以通過構造一個整數容器，先收集所有答案，具體如下：

Steps:

1.判斷輸入字符串總長度有無大於10，無則返回空結果

2.構建map，存儲ACGT到二進制的映射關係

3. 利用前十個鹼基構成的字符子串，構造出事的rstTemp值（循環十次，rstTemp不斷左移兩位，相加）

4. 當輸入DNA序列（字符串）還未遍歷完，定義一個十六進制數0x3ffff，rstTemp通過每次循環左移兩位在與該十六進制數逐位與，消除其最高兩位，然後通過加上s[當前遍歷數+10]，從而獲得一個鹼基值（對應二進制）

5.定義兩個set，定義爲record、helper，helper用於存儲一長度爲10子串

6. 通過find（）函數，分別去record、helper查找是否包含子串，如果helper成功找，則將當前子串最高位鹼基添加入結果容器

7.全部字符串內容遍歷完，返回結果容器

Codes:

class Solution
{
    #defineeraser 03xffff
public:
    vector<string>findRepeatedDnaSequences(string s)
    {
        vector<string> result;
        int rstTemp;
        
        if(s.size()<10)
            return result;
        
        set<unsigned int> record;
        set<unsigned int> helper;
        map<char,int> mapper{{'A',0},{'C',1},{'G',2},{'T',4}};
        
        for(int i=0;i!=10;++i)
        {
            rstTemp=(rstTemp<<2)+mapper[s[i]];
            set<unsigned int>::iterator j=record.find(rstTemp);
            
            if(j!=record.end()){
                j=helper.find(rstTemp);
                if(j==helper.end()){
                    result.push_back(string(s,i-9,10));
                    helper.insert(rstTemp);
                }
                
            }
            else{
                record.inset(rstTemp);
            }
        }
        return result;
    }
};

Results: