數據清洗與收集week2

2.1 從MySQL中獲取數據

首先是mysql的一些簡介



Connecting and listing databases

ucscDb <- dbConnect(MySQL(),user="genome", 
                    host="genome-mysql.cse.ucsc.edu")
result <- dbGetQuery(ucscDb,"show databases;"); dbDisconnect(ucscDb);
[1] TRUE
result
              Database
1   information_schema
2              ailMel1
3              allMis1
4              anoCar1
5              anoCar2
6              anoGam1
7              apiMel1
8              apiMel2
9              aplCal1
10             bosTau2
11             bosTau3
12             bosTau4
13             bosTau5
14             bosTau6
15             bosTau7
16           bosTauMd3
17             braFlo1
18             caeJap1
19              caePb1
20              caePb2

Connecting to hg19 and listing tables

hg19 <- dbConnect(MySQL(),user="genome", db="hg19",
                    host="genome-mysql.cse.ucsc.edu")
allTables <- dbListTables(hg19)
length(allTables)
is a measurement technology used to measure something about the genome

意指hg19的這張表

Get dimensions of a specific table

dbListFields(hg19,"affyU133Plus2")
 [1] "bin"         "matches"     "misMatches"  "repMatches"  "nCount"      "qNumInsert" 
 [7] "qBaseInsert" "tNumInsert"  "tBaseInsert" "strand"      "qName"       "qSize"      
[13] "qStart"      "qEnd"        "tName"       "tSize"       "tStart"      "tEnd"       
[19] "blockCount"  "blockSizes"  "qStarts"     "tStarts"    
dbGetQuery(hg19, "select count(*) from affyU133Plus2")
  count(*)
1    58463

Read from the table

affyData <- dbReadTable(hg19, "affyU133Plus2")
head(affyData)
  bin matches misMatches repMatches nCount qNumInsert qBaseInsert tNumInsert tBaseInsert strand
1 585     530          4          0     23          3          41          3         898      -
2 585    3355         17          0    109          9          67          9       11621      -
3 585    4156         14          0     83         16          18          2          93      -
4 585    4667          9          0     68         21          42          3        5743      -
5 585    5180         14          0    167         10          38          1          29      -
6 585     468          5          0     14          0           0          0           0      -
         qName qSize qStart qEnd tName     tSize tStart  tEnd blockCount
1  225995_x_at   637      5  603  chr1 249250621  14361 15816          5
2  225035_x_at  3635      0 3548  chr1 249250621  14381 29483         17
3  226340_x_at  4318      3 4274  chr1 249250621  14399 18745         18
4 1557034_s_at  4834     48 4834  chr1 249250621  14406 24893         23
5    231811_at  5399      0 5399  chr1 249250621  19688 25078         11
6    236841_at   487      0  487  chr1 249250621  27542 28029          1
                                                                  blockSizes
1                                                          93,144,229,70,21,
2              73,375,71,165,303,360,198,661,201,1,260,250,74,73,98,155,163,
3                 690,10,32,33,376,4,5,15,5,11,7,41,277,859,141,51,443,1253,
4 99,352,286,24,49,14,6,5,8,149,14,44,98,12,10,355,837,59,8,1500,133,624,58,
5                                       131,26,1300,6,4,11,4,7,358,3359,155,
6                                                                       487,
                                                                                                 qStarts
1                                                                                    34,132,278,541,611,
2                        87,165,540,647,818,1123,1484,1682,2343,2545,2546,2808,3058,3133,3206,3317,3472,
3                   44,735,746,779,813,1190,1195,1201,1217,1223,1235,1243,1285,1564,2423,2565,2617,3062,
4 0,99,452,739,764,814,829,836,842,851,1001,1016,1061,1160,1173,1184,1540,2381,2441,2450,3951,4103,4728,
5                                                     0,132,159,1460,1467,1472,1484,1489,1497,1856,5244,
6                                                                                                     0,
                                                                                                                                     tStarts
1                                                                                                             14361,14454,14599,14968,15795,
2                                     14381,14454,14969,15075,15240,15543,15903,16104,16853,17054,17232,17492,17914,17988,18267,24736,29320,
3                               14399,15089,15099,15131,15164,15540,15544,15549,15564,15569,15580,15587,15628,15906,16857,16998,17049,17492,
4 14406,20227,20579,20865,20889,20938,20952,20958,20963,20971,21120,21134,21178,21276,21288,21298,21653,22492,22551,22559,24059,24211,24835,
5                                                                         19688,19819,19845,21145,21151,21155,21166,21170,21177,21535,24923,
6                                                                                                                        

Select a specific subset

query <- dbSendQuery(hg19, "select * from affyU133Plus2 where misMatches between 1 and 3")
affyMis <- fetch(query); quantile(affyMis$misMatches)
  0%  25%  50%  75% 100% 
   1    1    2    2    3 
affyMisSmall <- fetch(query,n=10); dbClearResult(query);
[1] TRUE
dim(affyMisSmall)
[1] 10 22

Don't forget to close the connection!

dbDisconnect(hg19)
[1] TRUE

Further resources


2.2 從HDF5中獲取數據



http://www.bioconductor.org/packages/release/bioc/vignettes/rhdf5/inst/doc/rhdf5.pdf







2.3 從web上讀取數據

有多種方法,一種爲readLines,這種方法出來的格式不是太好

當然用前面的XML效果會更好些


先創建一個connection,再讀取,讀完後關掉,同於讀取數據庫的操作



此包中則用GET來抓取網頁,第二步中將抽取的內容html2變爲text形式,此時的形式有些類似於用readLines讀出的東東,第三步則用了XML包中的函數,兩包聯用啊

接下來這兩步同於XML操作了


那麼GET的好處在哪呢,在於可以通過認證,當status=200的時候表示我們可以訪問,若爲401,這倒是很常見哈,就是不能訪問的意思 


用下面這張圖的話呢,就不用每次都用登錄這麼麻煩了




2.4從APIs中讀取數據

API應用程序編程接口




這節有些困難啊,改天得回過頭來搞起

2.5 從其他來源獲取數據

如果需要從別的地方獲取數據,谷歌類以mysql R packages的方式會很有幫助








這門課要認真學的地方很多啊,一些注意的地方



發佈了31 篇原創文章 · 獲贊 1 · 訪問量 3萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章