2.1 從MySQL中獲取數據
首先是mysql的一些簡介
Connecting and listing databases
ucscDb <- dbConnect(MySQL(),user="genome",
host="genome-mysql.cse.ucsc.edu")
result <- dbGetQuery(ucscDb,"show databases;"); dbDisconnect(ucscDb);
[1] TRUE
result
Database
1 information_schema
2 ailMel1
3 allMis1
4 anoCar1
5 anoCar2
6 anoGam1
7 apiMel1
8 apiMel2
9 aplCal1
10 bosTau2
11 bosTau3
12 bosTau4
13 bosTau5
14 bosTau6
15 bosTau7
16 bosTauMd3
17 braFlo1
18 caeJap1
19 caePb1
20 caePb2
Connecting to hg19 and listing tables
hg19 <- dbConnect(MySQL(),user="genome", db="hg19",
host="genome-mysql.cse.ucsc.edu")
allTables <- dbListTables(hg19)
length(allTables)
意指hg19的這張表
Get dimensions of a specific table
dbListFields(hg19,"affyU133Plus2")
[1] "bin" "matches" "misMatches" "repMatches" "nCount" "qNumInsert"
[7] "qBaseInsert" "tNumInsert" "tBaseInsert" "strand" "qName" "qSize"
[13] "qStart" "qEnd" "tName" "tSize" "tStart" "tEnd"
[19] "blockCount" "blockSizes" "qStarts" "tStarts"
dbGetQuery(hg19, "select count(*) from affyU133Plus2")
count(*)
1 58463
Read from the table
affyData <- dbReadTable(hg19, "affyU133Plus2")
head(affyData)
bin matches misMatches repMatches nCount qNumInsert qBaseInsert tNumInsert tBaseInsert strand
1 585 530 4 0 23 3 41 3 898 -
2 585 3355 17 0 109 9 67 9 11621 -
3 585 4156 14 0 83 16 18 2 93 -
4 585 4667 9 0 68 21 42 3 5743 -
5 585 5180 14 0 167 10 38 1 29 -
6 585 468 5 0 14 0 0 0 0 -
qName qSize qStart qEnd tName tSize tStart tEnd blockCount
1 225995_x_at 637 5 603 chr1 249250621 14361 15816 5
2 225035_x_at 3635 0 3548 chr1 249250621 14381 29483 17
3 226340_x_at 4318 3 4274 chr1 249250621 14399 18745 18
4 1557034_s_at 4834 48 4834 chr1 249250621 14406 24893 23
5 231811_at 5399 0 5399 chr1 249250621 19688 25078 11
6 236841_at 487 0 487 chr1 249250621 27542 28029 1
blockSizes
1 93,144,229,70,21,
2 73,375,71,165,303,360,198,661,201,1,260,250,74,73,98,155,163,
3 690,10,32,33,376,4,5,15,5,11,7,41,277,859,141,51,443,1253,
4 99,352,286,24,49,14,6,5,8,149,14,44,98,12,10,355,837,59,8,1500,133,624,58,
5 131,26,1300,6,4,11,4,7,358,3359,155,
6 487,
qStarts
1 34,132,278,541,611,
2 87,165,540,647,818,1123,1484,1682,2343,2545,2546,2808,3058,3133,3206,3317,3472,
3 44,735,746,779,813,1190,1195,1201,1217,1223,1235,1243,1285,1564,2423,2565,2617,3062,
4 0,99,452,739,764,814,829,836,842,851,1001,1016,1061,1160,1173,1184,1540,2381,2441,2450,3951,4103,4728,
5 0,132,159,1460,1467,1472,1484,1489,1497,1856,5244,
6 0,
tStarts
1 14361,14454,14599,14968,15795,
2 14381,14454,14969,15075,15240,15543,15903,16104,16853,17054,17232,17492,17914,17988,18267,24736,29320,
3 14399,15089,15099,15131,15164,15540,15544,15549,15564,15569,15580,15587,15628,15906,16857,16998,17049,17492,
4 14406,20227,20579,20865,20889,20938,20952,20958,20963,20971,21120,21134,21178,21276,21288,21298,21653,22492,22551,22559,24059,24211,24835,
5 19688,19819,19845,21145,21151,21155,21166,21170,21177,21535,24923,
6
Select a specific subset
query <- dbSendQuery(hg19, "select * from affyU133Plus2 where misMatches between 1 and 3")
affyMis <- fetch(query); quantile(affyMis$misMatches)
0% 25% 50% 75% 100%
1 1 2 2 3
affyMisSmall <- fetch(query,n=10); dbClearResult(query);
[1] TRUE
dim(affyMisSmall)
[1] 10 22
Don't forget to close the connection!
dbDisconnect(hg19)
[1] TRUE
Further resources
- RMySQL vignette http://cran.r-project.org/web/packages/RMySQL/RMySQL.pdf
- List of commands http://www.pantz.org/software/mysql/mysqlcommands.html
- Do not, do not, delete, add or join things from ensembl. Only select.
- In general be careful with mysql commands
- A nice blog post summarizing some other commands http://www.r-bloggers.com/mysql-and-r/
2.2 從HDF5中獲取數據
http://www.bioconductor.org/packages/release/bioc/vignettes/rhdf5/inst/doc/rhdf5.pdf
2.3 從web上讀取數據
有多種方法,一種爲readLines,這種方法出來的格式不是太好
當然用前面的XML效果會更好些
先創建一個connection,再讀取,讀完後關掉,同於讀取數據庫的操作
此包中則用GET來抓取網頁,第二步中將抽取的內容html2變爲text形式,此時的形式有些類似於用readLines讀出的東東,第三步則用了XML包中的函數,兩包聯用啊
接下來這兩步同於XML操作了
那麼GET的好處在哪呢,在於可以通過認證,當status=200的時候表示我們可以訪問,若爲401,這倒是很常見哈,就是不能訪問的意思
用下面這張圖的話呢,就不用每次都用登錄這麼麻煩了
2.4從APIs中讀取數據
API應用程序編程接口
這節有些困難啊,改天得回過頭來搞起
2.5 從其他來源獲取數據
如果需要從別的地方獲取數據,谷歌類以mysql R packages的方式會很有幫助
這門課要認真學的地方很多啊,一些注意的地方