集體智慧編程（二）發現羣組

博客地址：http://andyheart.me，首先會更新本人自己的博客，然後更新CSDN。

有錯誤之處，請私信或者評論，多謝。

概念

數據聚類：一種用以尋找緊密相關的事、人或觀點，並將其可視化的方法。目的是採集數據，然後從中找出不同的羣組。
監督學習：利用樣本輸入和期望輸出來學習如何預測的技術。例如，神經網絡，決策樹，支持向量機，貝葉斯過濾。
無監督學習：在一組數據中找尋某種結構，數據本身不是我們要找的答案。

分級聚類：通過連續不斷地將最爲相似的羣組兩兩合併，來構造出一個羣組的層級結構。其中的每個羣組都是從單一元素開始的。
K均值聚類：首先隨機確定K箇中心位置，然後將各個數據項分配給最鄰近的中心點。待分配完成之後，聚類中心就會移到分配給該聚類的所有節點的平均位置，然後分配過程重新開始。一直重複直到分配過程不再產生變化爲止。

主要內容

從各種不同的來源中構造算法所需的數據；
兩種不同的聚類算法（分級聚類和K-均值聚類）；
更多有關距離度量的知識；
簡單的圖形可視化代碼，用以觀察所生成的羣組；
將異常複雜的數據集投影到二維空間中。

示例

對博客用戶進行分類
根據單詞出現的頻度對博客進行聚類，可以分析出經常撰寫相似主題的人。

（一）對訂閱源中的單詞進行計數

RSS訂閱源 是一個包含博客及其所有文章條目信息的簡單的XML文檔。爲了給單詞計數，首先應該解析這些訂閱源，可以利用Universal Feed Parser。

代碼解釋：這一部分主要是爲了得到將要進行處理的數據集。代碼由python實現文件爲generatefeedvector。主要流程爲：利用Universal Feed Parser將從feedlist.txt中列表的地址中得到的RSS源一一解析得到標題和文章條目從而從中分離到word再計數。

Python代碼如下：

import feedparser
import re

# Returns title and dictionary of word counts for an RSS feed
def getwordcounts(url):
  # Parse the feed
  d=feedparser.parse(url)
  wc={}

  # Loop over all the entries
  for e in d.entries:
    if 'summary' in e: summary=e.summary
    else: summary=e.description

    # Extract a list of words
    words=getwords(e.title+' '+summary)
    for word in words:
      wc.setdefault(word,0)
      wc[word]+=1
  return d.feed.title,wc

def getwords(html):
  # Remove all the HTML tags
  txt=re.compile(r'<[^>]+>').sub('',html)

  # Split words by all non-alpha characters
  words=re.compile(r'[^A-Z^a-z]+').split(txt)

  # Convert to lowercase
  return [word.lower() for word in words if word!='']


apcount={}
wordcounts={}
feedlist=[line for line in file('feedlist.txt')]
for feedurl in feedlist:
  try:
    title,wc=getwordcounts(feedurl)
    wordcounts[title]=wc
    for word,count in wc.items():
      apcount.setdefault(word,0)
      if count>1:
        apcount[word]+=1
  except:
    print 'Failed to parse feed %s' % feedurl

wordlist=[]
for w,bc in apcount.items():
  frac=float(bc)/len(feedlist)
  if frac>0.1 and frac<0.5:
    wordlist.append(w)

out=file('blogdata1.txt','w')
out.write('Blog')
for word in wordlist: out.write('\t%s' % word)
out.write('\n')
for blog,wc in wordcounts.items():
  print blog
  out.write(blog)
  for word in wordlist:
    if word in wc: out.write('\t%d' % wc[word])
    else: out.write('\t0')
  out.write('\n')

feedlist.txt中的url地址列舉如下幾個：

http://gofugyourself.typepad.com/go_fug_yourself/index.rdf
http://googleblog.blogspot.com/rss.xml
http://feeds.feedburner.com/GoogleOperatingSystem
http://headrush.typepad.com/creating_passionate_users/index.rdf
http://feeds.feedburner.com/instapundit/main
http://jeremy.zawodny.com/blog/rss2.xml
http://joi.ito.com/index.rdf
http://feeds.feedburner.com/Mashable
http://michellemalkin.com/index.rdf
http://moblogsmoproblems.blogspot.com/rss.xml
http://newsbusters.org/node/feed
http://beta.blogger.com/feeds/27154654/posts/full?alt=rss
http://feeds.feedburner.com/paulstamatiou
http://powerlineblog.com/index.rdf

（二）對數據進行分級聚類

這一部分主要對數據集，也就是單詞向量進行皮爾遜相關係數的計算，從而得到相關程度的度量。遞歸以後得到博客的分組。此部分的代碼寫在clusters.py文件中。
主要包括readfile()方法（加載數據文件）、pearson(v1,v2) 返回兩個列表的皮爾遜相關係數、hcluster() 就是分級聚類的主要函數。
（三）分級聚類可視化（繪製樹狀圖）
主要利用python的PIL包進行聚類的樹狀圖的繪製。
（四）對數據進行K-均值聚類
Python代碼如下：

def kcluster(rows,distance=pearson,k=4):
  # Determine the minimum and maximum values for each point
  ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) 
  for i in range(len(rows[0]))]

  # Create k randomly placed centroids
  clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] 
  for i in range(len(rows[0]))] for j in range(k)]

  lastmatches=None
  for t in range(100):
    print 'Iteration %d' % t
    bestmatches=[[] for i in range(k)]

    # Find which centroid is the closest for each row
    for j in range(len(rows)):
      row=rows[j]
      bestmatch=0
      for i in range(k):
        d=distance(clusters[i],row)
        if d<distance(clusters[bestmatch],row): bestmatch=i
      bestmatches[bestmatch].append(j)

    # If the results are the same as last time, this is complete
    if bestmatches==lastmatches: break
    lastmatches=bestmatches

    # Move the centroids to the average of their members
    for i in range(k):
      avgs=[0.0]*len(rows[0])
      if len(bestmatches[i])>0:
        for rowid in bestmatches[i]:
          for m in range(len(rows[rowid])):
            avgs[m]+=rows[rowid][m]
        for j in range(len(avgs)):
          avgs[j]/=len(bestmatches[i])
        clusters[i]=avgs

  return bestmatches

最後提到了針對於偏好的聚類，對於寫推薦引擎有一定的幫助。

集體智慧編程（二）發現羣組

概念

主要內容

示例

PDManer [元數建模]-v4.9.0 發佈：一款簡單好用的數據庫建模平臺

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

cs01 CSS Syntax

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

[MASM拾遺]Offset僞指令

h30 HTML Layout Elements

瞭解顯卡

一款基於C#開發的通訊調試工具（支持Modbus RTU、MQTT調試）

Linux/Golang/glibC系統調用

cs04 CSS Measurement Units

集體智慧編程（四）優化

Java防盜鏈（防止網頁從其他地方直接訪問）

elasticsearch Getting Started (三)-探索集羣

【Python】爬蟲爬取各大網站新聞（一）

elasticsearch Getting Started (二)-安裝

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結