Description:
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
Note:
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT", Return: ["AAAAACCCCC", "CCCCCAAAAA"].
Solution:
Analysis and Thinking:
問題要求我們寫一個函數,實現找到所有長度爲10的重複字符串(在輸入中出現次數超過兩次)。這裏爲了便於對輸入數據進行處理,採取了字符->二進制數字映射操作,即對於ACGT這四個代表不同鹼基類型的字符,用00,01,10,11來代替,從而把長度位10的字符串轉換爲20位整數值,從而便於處理,改變存儲的容器類型還能一定程度減小空間複雜性。
問題的求解可以通過構造一個整數容器,先收集所有答案,具體如下:
Steps:
1.判斷輸入字符串總長度有無大於10,無則返回空結果
2.構建map,存儲ACGT到二進制的映射關係
3. 利用前十個鹼基構成的字符子串,構造出事的rstTemp值(循環十次,rstTemp不斷左移兩位,相加)
4. 當輸入DNA序列(字符串)還未遍歷完,定義一個十六進制數0x3ffff,rstTemp通過每次循環左移兩位在與該十六進制數逐位與,消除其最高兩位,然後通過加上s[當前遍歷數+10],從而獲得一個鹼基值(對應二進制)
5.定義兩個set,定義爲record、helper,helper用於存儲一長度爲10子串
6. 通過find()函數,分別去record、helper查找是否包含子串,如果helper成功找,則將當前子串最高位鹼基添加入結果容器
7.全部字符串內容遍歷完,返回結果容器
Codes:
class Solution
{
#defineeraser 03xffff
public:
vector<string>findRepeatedDnaSequences(string s)
{
vector<string> result;
int rstTemp;
if(s.size()<10)
return result;
set<unsigned int> record;
set<unsigned int> helper;
map<char,int> mapper{{'A',0},{'C',1},{'G',2},{'T',4}};
for(int i=0;i!=10;++i)
{
rstTemp=(rstTemp<<2)+mapper[s[i]];
set<unsigned int>::iterator j=record.find(rstTemp);
if(j!=record.end()){
j=helper.find(rstTemp);
if(j==helper.end()){
result.push_back(string(s,i-9,10));
helper.insert(rstTemp);
}
}
else{
record.inset(rstTemp);
}
}
return result;
}
};
Results: