mahout下的Hadoop平臺上的Kmeans算法實現

Mahout主要有協同過濾、聚類和分類三種算法的實現。現在我們就用Mahout來實現經典的Kmeans聚類算法。

首先,下載HadoopMahout。因爲Mahout有很多實現是運行在Hadoop上的,所以要先安裝Hadoop

具體怎麼安裝?簡單地說一下:

1. 先安裝SSH

ufw disable 關閉防火牆

 

cd .ssh/   進入ssh文件夾,沒有的話,下面生產密鑰的時候自動生成

ssh-keygen -t rsa 生成ssh密鑰

cp id_rsa.pub authorized_keys 複製多一份

ssh localhost 測試是否聯通

sudo apt-get install openssh-server 安裝ssh服務

net start sshd 啓動ssh服務

 

2. 解壓Hadoop

 

tar -zxvf hadoop-1.1.2.tar.gz 解壓tar.gz

 

3. 添加環境變量

export JAVA_HOME=/usr/local/jdk7 增加環境變量

export PATH=.:$JAVA_HOME/bin:$PATH 增加環境變量

4. 單機運行的話至少修改四個配置文件

5. 其他命令

 

hadoop namenode -format 格式化hadoopnamenode,datanode不需要格式化

start-all.sh 啓動所有的hadoop服務

stop-all.sh 關閉所有的hadoop服務

start-dfs.sh 單獨啓動hdfs

stop-dfs.sh

start-mapred.sh 啓動MapReduce的兩個服務

hadoop-daemon.sh start[進程名稱] 單獨啓動進程

 

jps 查看正在運行的各種進程

 

ps -e | grep ssh  查看防火牆服務是否開啓

ifconfig -a |grep inet 查看網絡連接地址

6. Mahout的安裝也類似

 

先解壓,再配置環境變量,最後輸入mahout命令,有各種算法列出來就是安裝成功了!

http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz

下載 Reuters21578 文本語料。也可以自己準備數據集。我用我自己的數據集來做實現。

我收集了1000首歌曲的信息,如下:


把這些信息存入mongodb數據庫中,以後還要使用,當然不存也可以。然後用java代碼取出來,每首歌曲生成一個txt文件。並且做了處理,標籤值賦予不同的權重,歌詞進行了分詞處理。

Map<String,Object> outmap = new HashMap<String, Object>();
        outmap.put("flag",false);
        List<Song> list = songRepository.findAll();
        int size = list.size() ;
        String[] strs = new String[size];
        if (list != null){
            //循環每一首歌曲
            for (int i = 0; i < size; i++) {
                Song song = list.get(i);
                //有權值的標籤
                StringBuilder sb = new StringBuilder();
                for (int j = 0; j < 8; j++) {
                    sb.append(song.getArtist()).append(" ");
                }
                for (int j = 0; j < 2; j++) {
                    sb.append(song.getAlbum()).append(" ");
                }
                for (int j = 0; j < 5; j++) {
                    sb.append(song.getType()).append(" ");
                }
                for (int j = 0; j < 3; j++) {
                    sb.append(song.getDistrict()).append(" ");
                }
                for (int j = 0; j < 6; j++) {
                    sb.append(song.getYears()).append(" ");
                }
                for (int j = 0; j < 3; j++) {
                    sb.append(song.getRhythm()).append(" ");
                }
                for (int j = 0; j < 4; j++) {
                    sb.append(song.getMood()).append(" ");
                }
                //無權值的歌詞
                String strLrc = song.getLrc() ;
                //對歌詞進行分詞
                strLrc = SplitWord.splitWordBySpace(strLrc);
                sb.append(strLrc);
                strs[i] = sb.toString() ;
            }
            //寫出文件
            WriteLines.writeStrBecomeTxts("C:\\Users\\xin\\Desktop\\大論文\\Scala","utf-8",strs);
            outmap.put("flag",true);
            return outmap ;
        } else {
            return outmap ;
        }

生成的文件如下:


把這些文件壓縮成一個文件,也就是Hadoop可以解析的SequenceFile格式的文件

Mahout seqdirectory -i file:/usr/song-input -o file:/usr/song-output -c UTF-8 -chunk 64 -xm sequential

file:前綴是指在本地文件系統上尋找,而不是HDFS-xm sequential 就是本地執行的意思。

-chunk 64 壓縮成64M一個文件,HDFS文件系統的單位就是64M

 

接着就是把SequenceFile格式的文件轉換爲向量Vector。把上一步生成的文件放到HDFS文件系統上。運行命令:

hadoop fs -mkdir input
hadoop fs -put /usr/song-output/chunk-0 input
Mahout seq2sparse -i input -o output -ow --weight tfidf --maxDFPercent  95 --nameVector -a org.apache.lucene.analysis.WhitespaceAnalyzer

-i 輸入目錄

-o 輸出目錄

--weight 權重公式

--maxDFPercent 過濾高詞頻 >95%

-a 指定分詞器 因爲我們前面已經用IK分過詞了,這裏直接按空格分詞就可以了

各個參數如下圖:

生成目錄:

root@xin:~# hadoop fs -ls output
Warning: $HADOOP_HOME is deprecated.

Found 7 items
drwxr-xr-x   - root supergroup          0 2015-03-30 14:11 /user/root/output/df-count
-rw-r--r--   1 root supergroup      48768 2015-03-30 14:10 /user/root/output/dictionary.file-0
-rw-r--r--   1 root supergroup      51433 2015-03-30 14:11 /user/root/output/frequency.file-0
drwxr-xr-x   - root supergroup          0 2015-03-30 14:11 /user/root/output/tf-vectors
drwxr-xr-x   - root supergroup          0 2015-03-30 14:12 /user/root/output/tfidf-vectors
drwxr-xr-x   - root supergroup          0 2015-03-30 14:09 /user/root/output/tokenized-documents
drwxr-xr-x   - root supergroup          0 2015-03-30 14:10 /user/root/output/wordcount

· dictionary.file-0:詞文本 -> id(int)的映射。詞轉化爲id,這是常見做法。

· frequency.file:詞id -> 文檔集詞頻(cf)

· wordcount(目錄): 詞文本 -> 文檔集詞頻(cf),這個應該是各種過濾處理之前的信息。

· df-count(目錄): 詞id -> 文檔頻率(df)

· tf-vectorstfidf-vectors (均爲目錄):詞向量,每篇文檔一行,格式爲{id:特徵值},其中特徵值爲tftfidf。有用採用了內置類型VectorWritable,需要 用命令”mahout vectordump -i <path>”查看。

· tokenized-documents:分詞後的文檔。

現在來運行Kmeans算法了!

Mahout kmeans -i output/tfidf-vectors -c output/kmeans-clusters -o output/kmeas -k 10
-x 200 -ow --clustering

參數說明如下:

 

    -i:輸入爲上面產出的tfidf向量。

    -o:每一輪迭代的結果將輸出在這裏。

    -k:幾個簇。

    -c:這是一個神奇的變量。若不設定k,則用這個目錄裏面的點,作爲聚類中心點。否則,隨機選擇k個點,作爲中心點。

    -dm:距離公式,文本類型推薦用cosine距離。

    -x :最大迭代次數。

    –clustering:在mapreduce模式運行。

    –convergenceDelta:迭代收斂閾值,默認0.5,對於Cosine來說略大。

其中,clusters-k(-final)爲每次迭代後,簇的20箇中心點的信息。

clusterdPoints,存儲了 簇id -> 文檔id 的映射。

 

 

生成的結果文件夾kmeans最好拷貝出來看。

hadoop fs -get output/kmeans/* /usr/song-kmeans/
Warning: $HADOOP_HOME is deprecated.

hadoop fs -get output/dictionary.file-0 /usr/song-kmeans
Warning: $HADOOP_HOME is deprecated.

mahout clusterdump -i file:///usr/song-kmeans/clusters-5-final  -d file:///usr/song-kmeans/dictionary.file-0 -dt sequencefile -o /usr/song-result/result  -n 20

mahout seqdumper -i file:///usr/song-kmeans/clusteredPoints  -o /usr/song-result/all

clusteredPoints文件其實就是SequenceFile文件來的。

 

result文件裏面的內容:


可見有太多的無用詞彙,分詞效果不好,這些詞彙需要過濾掉!

其中前面的26是簇的IDn=7即簇中有這麼多個文檔。c向量是簇中心點向量,格式爲 詞文本:權重(點座標)r是簇的半徑向量,格式爲 詞文本:半徑。

下面的Top Terms是簇中選取出來的特徵詞。


all文件裏面的內容:

KeyClusterID,上面clusterdump的時候,已經說了。

Value是文檔的聚類結果:wt是文檔屬於簇的概率,對於kmeans總是1.0/1.txt就是文檔標誌啦,前面seqdirectionary-nv起作用了,再後面的就是這個點的各個詞id和權重了。

 

某個簇的數據有點多了,簇與簇之間數據分佈不夠均勻,可見聚類效果不是很好。還要改善文檔質量!

整個過程:

root@xin:~# vi /etc/profile
root@xin:~# start-all.sh
Warning: $HADOOP_HOME is deprecated.

starting namenode, logging to /usr/local/hadoop-1.1.2/libexec/../logs/hadoop-root-namenode-xin.out
xin: starting datanode, logging to /usr/local/hadoop-1.1.2/libexec/../logs/hadoop-root-datanode-xin.out
xin: starting secondarynamenode, logging to /usr/local/hadoop-1.1.2/libexec/../logs/hadoop-root-secondarynamenode-xin.out
starting jobtracker, logging to /usr/local/hadoop-1.1.2/libexec/../logs/hadoop-root-jobtracker-xin.out
xin: starting tasktracker, logging to /usr/local/hadoop-1.1.2/libexec/../logs/hadoop-root-tasktracker-xin.out
root@xin:~# jps
3149 NameNode
3541 SecondaryNameNode
3782 TaskTracker
3937 Jps
3632 JobTracker
3382 DataNode


=============================




root@xin:~# mahout seqdirectory -i file:/usr/song-input/ -o file:/usr/song-output/ -c UTF-8 -chunk 64 -xm sequential 
Warning: $HADOOP_HOME is deprecated.

Running on hadoop, using /usr/local/hadoop-1.1.2/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /usr/local/mahout-distribution-0.9/mahout-examples-0.9-job.jar
Warning: $HADOOP_HOME is deprecated.

15/03/30 13:57:11 INFO common.AbstractJob: Command line arguments: {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], --input=[file:/usr/song-input/], --keyPrefix=[], --method=[sequential], --output=[file:/usr/song-output/], --startPhase=[0], --tempDir=[temp]}
15/03/30 13:57:11 INFO util.NativeCodeLoader: Loaded the native-hadoop library
15/03/30 13:57:11 INFO driver.MahoutDriver: Program took 411 ms (Minutes: 0.00685)



====================================================


root@xin:~# hadoop fs -ls
Warning: $HADOOP_HOME is deprecated.

Found 3 items
drwxr-xr-x   - root supergroup          0 2015-03-29 19:49 /user/root/input
drwxr-xr-x   - root supergroup          0 2015-03-29 22:31 /user/root/look
drwxr-xr-x   - root supergroup          0 2015-03-29 20:05 /user/root/output
root@xin:~# hadoop fs -rmr input
Warning: $HADOOP_HOME is deprecated.

Deleted hdfs://xin:9000/user/root/input
root@xin:~# hadoop fs -rmr look
Warning: $HADOOP_HOME is deprecated.

Deleted hdfs://xin:9000/user/root/look
root@xin:~# hadoop fs -rmr output
Warning: $HADOOP_HOME is deprecated.

Deleted hdfs://xin:9000/user/root/output
root@xin:~# hadoop fs -mkdir input
Warning: $HADOOP_HOME is deprecated.

root@xin:~# hadoop fs -put /usr/song-output/chunk-0 input
Warning: $HADOOP_HOME is deprecated.

==========================

/58.txt	蔡琴 蔡琴 蔡琴 蔡琴 蔡琴 蔡琴 蔡琴 蔡琴 蔡琴情歌 蔡琴情歌 經典 經典 經典 經典 經典 臺灣 臺灣 臺灣 70s 70s 70s 70s 70s 70s 慢板 慢板 慢板 祝福 祝福 祝福 祝福 讀你 千遍 也 不 厭倦 讀你 感覺 像 三月 浪漫 季節 醉人 詩篇 唔 讀你 千遍 也 不 厭倦 讀你 感覺 象 春天 喜悅 經典 美麗 句點 唔 眉目之間 鎖 着 愛憐 脣齒 之間 留着 誓言 一切 移動 左右 視線 是 詩篇 讀你 千遍 也 不 厭倦 讀你 千遍 也 不 厭倦 讀你 感覺 像 三月 浪漫 季節 醉人 詩篇 唔 讀你 千遍 也 不 厭倦 讀你 感覺 象 春天 喜悅 經典 美麗 句點 唔 眉目之間 鎖 着 愛憐 脣齒 之間 留着 誓言 一切 移動 左右 視線 是 詩篇 讀你 千遍 也 不 厭倦 眉目之間 鎖 着 愛憐 脣齒 之間 留着 誓言 一切 移動 左右 視線 是 詩篇 讀你 千遍 也 不 厭倦 讀你 千遍 也 不 厭倦 讀你 千遍 也 不 厭倦 讀你 

root@xin:~# hadoop fs -text input/chunk-0

================================


root@xin:~# mahout seq2sparse -i input -o output -ow --weight tfidf --maxDFPercent  95 --namedVector -a org.apache.lucene.analysis.core.WhitespaceAnalyzer
Warning: $HADOOP_HOME is deprecated.

Running on hadoop, using /usr/local/hadoop-1.1.2/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /usr/local/mahout-distribution-0.9/mahout-examples-0.9-job.jar
Warning: $HADOOP_HOME is deprecated.

15/03/30 14:09:51 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum n-gram size is: 1
15/03/30 14:09:51 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum LLR value: 1.0
15/03/30 14:09:51 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of reduce tasks: 1
15/03/30 14:09:51 INFO vectorizer.SparseVectorsFromSequenceFiles: Tokenizing documents in input
15/03/30 14:09:51 INFO input.FileInputFormat: Total input paths to process : 1
15/03/30 14:09:52 INFO mapred.JobClient: Running job: job_201503301351_0001
15/03/30 14:09:53 INFO mapred.JobClient:  map 0% reduce 0%
15/03/30 14:10:00 INFO mapred.JobClient:  map 100% reduce 0%
15/03/30 14:10:00 INFO mapred.JobClient: Job complete: job_201503301351_0001
15/03/30 14:10:00 INFO mapred.JobClient: Counters: 19
15/03/30 14:10:00 INFO mapred.JobClient:   Job Counters 
15/03/30 14:10:00 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=4936
15/03/30 14:10:00 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/03/30 14:10:00 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/03/30 14:10:00 INFO mapred.JobClient:     Launched map tasks=1
15/03/30 14:10:00 INFO mapred.JobClient:     Data-local map tasks=1
15/03/30 14:10:00 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
15/03/30 14:10:00 INFO mapred.JobClient:   File Output Format Counters 
15/03/30 14:10:00 INFO mapred.JobClient:     Bytes Written=131137
15/03/30 14:10:00 INFO mapred.JobClient:   FileSystemCounters
15/03/30 14:10:00 INFO mapred.JobClient:     HDFS_BYTES_READ=131227
15/03/30 14:10:00 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=53968
15/03/30 14:10:00 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=131137
15/03/30 14:10:00 INFO mapred.JobClient:   File Input Format Counters 
15/03/30 14:10:00 INFO mapred.JobClient:     Bytes Read=131123
15/03/30 14:10:00 INFO mapred.JobClient:   Map-Reduce Framework
15/03/30 14:10:00 INFO mapred.JobClient:     Map input records=149
15/03/30 14:10:00 INFO mapred.JobClient:     Physical memory (bytes) snapshot=89587712
15/03/30 14:10:00 INFO mapred.JobClient:     Spilled Records=0
15/03/30 14:10:00 INFO mapred.JobClient:     CPU time spent (ms)=590
15/03/30 14:10:00 INFO mapred.JobClient:     Total committed heap usage (bytes)=120061952
15/03/30 14:10:00 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=675536896
15/03/30 14:10:00 INFO mapred.JobClient:     Map output records=149
15/03/30 14:10:00 INFO mapred.JobClient:     SPLIT_RAW_BYTES=104
15/03/30 14:10:00 INFO vectorizer.SparseVectorsFromSequenceFiles: Creating Term Frequency Vectors
15/03/30 14:10:00 INFO vectorizer.DictionaryVectorizer: Creating dictionary from output/tokenized-documents and saving at output/wordcount
15/03/30 14:10:00 INFO input.FileInputFormat: Total input paths to process : 1
15/03/30 14:10:00 INFO mapred.JobClient: Running job: job_201503301351_0002
15/03/30 14:10:01 INFO mapred.JobClient:  map 0% reduce 0%
15/03/30 14:10:06 INFO mapred.JobClient:  map 100% reduce 0%
15/03/30 14:10:13 INFO mapred.JobClient:  map 100% reduce 33%
15/03/30 14:10:15 INFO mapred.JobClient:  map 100% reduce 100%
15/03/30 14:10:15 INFO mapred.JobClient: Job complete: job_201503301351_0002
15/03/30 14:10:15 INFO mapred.JobClient: Counters: 29
15/03/30 14:10:15 INFO mapred.JobClient:   Job Counters 
15/03/30 14:10:15 INFO mapred.JobClient:     Launched reduce tasks=1
15/03/30 14:10:15 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=4037
15/03/30 14:10:15 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/03/30 14:10:15 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/03/30 14:10:15 INFO mapred.JobClient:     Launched map tasks=1
15/03/30 14:10:15 INFO mapred.JobClient:     Data-local map tasks=1
15/03/30 14:10:15 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8693
15/03/30 14:10:15 INFO mapred.JobClient:   File Output Format Counters 
15/03/30 14:10:15 INFO mapred.JobClient:     Bytes Written=59037
15/03/30 14:10:15 INFO mapred.JobClient:   FileSystemCounters
15/03/30 14:10:15 INFO mapred.JobClient:     FILE_BYTES_READ=69108
15/03/30 14:10:15 INFO mapred.JobClient:     HDFS_BYTES_READ=131267
15/03/30 14:10:15 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=247350
15/03/30 14:10:15 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=59037
15/03/30 14:10:15 INFO mapred.JobClient:   File Input Format Counters 
15/03/30 14:10:15 INFO mapred.JobClient:     Bytes Read=131137
15/03/30 14:10:15 INFO mapred.JobClient:   Map-Reduce Framework
15/03/30 14:10:15 INFO mapred.JobClient:     Map output materialized bytes=69108
15/03/30 14:10:15 INFO mapred.JobClient:     Map input records=149
15/03/30 14:10:15 INFO mapred.JobClient:     Reduce shuffle bytes=69108
15/03/30 14:10:15 INFO mapred.JobClient:     Spilled Records=8116
15/03/30 14:10:15 INFO mapred.JobClient:     Map output bytes=117804
15/03/30 14:10:15 INFO mapred.JobClient:     Total committed heap usage (bytes)=296222720
15/03/30 14:10:15 INFO mapred.JobClient:     CPU time spent (ms)=2850
15/03/30 14:10:15 INFO mapred.JobClient:     Combine input records=8090
15/03/30 14:10:15 INFO mapred.JobClient:     SPLIT_RAW_BYTES=130
15/03/30 14:10:15 INFO mapred.JobClient:     Reduce input records=4058
15/03/30 14:10:15 INFO mapred.JobClient:     Reduce input groups=4058
15/03/30 14:10:15 INFO mapred.JobClient:     Combine output records=4058
15/03/30 14:10:15 INFO mapred.JobClient:     Physical memory (bytes) snapshot=310415360
15/03/30 14:10:15 INFO mapred.JobClient:     Reduce output records=2542
15/03/30 14:10:15 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1367658496
15/03/30 14:10:15 INFO mapred.JobClient:     Map output records=8090
15/03/30 14:10:15 INFO input.FileInputFormat: Total input paths to process : 1
15/03/30 14:10:15 INFO mapred.JobClient: Running job: job_201503301351_0003
15/03/30 14:10:16 INFO mapred.JobClient:  map 0% reduce 0%
15/03/30 14:10:21 INFO mapred.JobClient:  map 100% reduce 0%
15/03/30 14:10:29 INFO mapred.JobClient:  map 100% reduce 33%
15/03/30 14:10:31 INFO mapred.JobClient:  map 100% reduce 100%
15/03/30 14:10:31 INFO mapred.JobClient: Job complete: job_201503301351_0003
15/03/30 14:10:31 INFO mapred.JobClient: Counters: 29
15/03/30 14:10:31 INFO mapred.JobClient:   Job Counters 
15/03/30 14:10:31 INFO mapred.JobClient:     Launched reduce tasks=1
15/03/30 14:10:31 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3865
15/03/30 14:10:31 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/03/30 14:10:31 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/03/30 14:10:31 INFO mapred.JobClient:     Launched map tasks=1
15/03/30 14:10:31 INFO mapred.JobClient:     Data-local map tasks=1
15/03/30 14:10:31 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8558
15/03/30 14:10:31 INFO mapred.JobClient:   File Output Format Counters 
15/03/30 14:10:31 INFO mapred.JobClient:     Bytes Written=70371
15/03/30 14:10:31 INFO mapred.JobClient:   FileSystemCounters
15/03/30 14:10:31 INFO mapred.JobClient:     FILE_BYTES_READ=178553
15/03/30 14:10:31 INFO mapred.JobClient:     HDFS_BYTES_READ=131267
15/03/30 14:10:31 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=371870
15/03/30 14:10:31 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=70371
15/03/30 14:10:31 INFO mapred.JobClient:   File Input Format Counters 
15/03/30 14:10:31 INFO mapred.JobClient:     Bytes Read=131137
15/03/30 14:10:31 INFO mapred.JobClient:   Map-Reduce Framework
15/03/30 14:10:31 INFO mapred.JobClient:     Map output materialized bytes=129393
15/03/30 14:10:31 INFO mapred.JobClient:     Map input records=149
15/03/30 14:10:31 INFO mapred.JobClient:     Reduce shuffle bytes=129393
15/03/30 14:10:31 INFO mapred.JobClient:     Spilled Records=298
15/03/30 14:10:31 INFO mapred.JobClient:     Map output bytes=128796
15/03/30 14:10:31 INFO mapred.JobClient:     Total committed heap usage (bytes)=296222720
15/03/30 14:10:31 INFO mapred.JobClient:     CPU time spent (ms)=2200
15/03/30 14:10:31 INFO mapred.JobClient:     Combine input records=0
15/03/30 14:10:31 INFO mapred.JobClient:     SPLIT_RAW_BYTES=130
15/03/30 14:10:31 INFO mapred.JobClient:     Reduce input records=149
15/03/30 14:10:31 INFO mapred.JobClient:     Reduce input groups=149
15/03/30 14:10:31 INFO mapred.JobClient:     Combine output records=0
15/03/30 14:10:31 INFO mapred.JobClient:     Physical memory (bytes) snapshot=290947072
15/03/30 14:10:31 INFO mapred.JobClient:     Reduce output records=149
15/03/30 14:10:31 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1365016576
15/03/30 14:10:31 INFO mapred.JobClient:     Map output records=149
15/03/30 14:10:31 INFO input.FileInputFormat: Total input paths to process : 1
15/03/30 14:10:31 INFO mapred.JobClient: Running job: job_201503301351_0004
15/03/30 14:10:32 INFO mapred.JobClient:  map 0% reduce 0%
15/03/30 14:10:37 INFO mapred.JobClient:  map 100% reduce 0%
15/03/30 14:10:44 INFO mapred.JobClient:  map 100% reduce 33%
15/03/30 14:10:45 INFO mapred.JobClient:  map 100% reduce 100%
15/03/30 14:10:46 INFO mapred.JobClient: Job complete: job_201503301351_0004
15/03/30 14:10:46 INFO mapred.JobClient: Counters: 29
15/03/30 14:10:46 INFO mapred.JobClient:   Job Counters 
15/03/30 14:10:46 INFO mapred.JobClient:     Launched reduce tasks=1
15/03/30 14:10:46 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3819
15/03/30 14:10:46 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/03/30 14:10:46 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/03/30 14:10:46 INFO mapred.JobClient:     Launched map tasks=1
15/03/30 14:10:46 INFO mapred.JobClient:     Data-local map tasks=1
15/03/30 14:10:46 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8497
15/03/30 14:10:46 INFO mapred.JobClient:   File Output Format Counters 
15/03/30 14:10:46 INFO mapred.JobClient:     Bytes Written=70371
15/03/30 14:10:46 INFO mapred.JobClient:   FileSystemCounters
15/03/30 14:10:46 INFO mapred.JobClient:     FILE_BYTES_READ=69087
15/03/30 14:10:46 INFO mapred.JobClient:     HDFS_BYTES_READ=70499
15/03/30 14:10:46 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=248304
15/03/30 14:10:46 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=70371
15/03/30 14:10:46 INFO mapred.JobClient:   File Input Format Counters 
15/03/30 14:10:46 INFO mapred.JobClient:     Bytes Read=70371
15/03/30 14:10:46 INFO mapred.JobClient:   Map-Reduce Framework
15/03/30 14:10:46 INFO mapred.JobClient:     Map output materialized bytes=69087
15/03/30 14:10:46 INFO mapred.JobClient:     Map input records=149
15/03/30 14:10:46 INFO mapred.JobClient:     Reduce shuffle bytes=69087
15/03/30 14:10:46 INFO mapred.JobClient:     Spilled Records=298
15/03/30 14:10:46 INFO mapred.JobClient:     Map output bytes=68509
15/03/30 14:10:46 INFO mapred.JobClient:     Total committed heap usage (bytes)=296222720
15/03/30 14:10:46 INFO mapred.JobClient:     CPU time spent (ms)=1850
15/03/30 14:10:46 INFO mapred.JobClient:     Combine input records=0
15/03/30 14:10:46 INFO mapred.JobClient:     SPLIT_RAW_BYTES=128
15/03/30 14:10:46 INFO mapred.JobClient:     Reduce input records=149
15/03/30 14:10:46 INFO mapred.JobClient:     Reduce input groups=149
15/03/30 14:10:46 INFO mapred.JobClient:     Combine output records=0
15/03/30 14:10:46 INFO mapred.JobClient:     Physical memory (bytes) snapshot=296898560
15/03/30 14:10:46 INFO mapred.JobClient:     Reduce output records=149
15/03/30 14:10:46 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1361080320
15/03/30 14:10:46 INFO mapred.JobClient:     Map output records=149
15/03/30 14:10:46 INFO common.HadoopUtil: Deleting output/partial-vectors-0
15/03/30 14:10:46 INFO vectorizer.SparseVectorsFromSequenceFiles: Calculating IDF
15/03/30 14:10:46 INFO input.FileInputFormat: Total input paths to process : 1
15/03/30 14:10:46 INFO mapred.JobClient: Running job: job_201503301351_0005
15/03/30 14:10:47 INFO mapred.JobClient:  map 0% reduce 0%
15/03/30 14:10:52 INFO mapred.JobClient:  map 100% reduce 0%
15/03/30 14:10:59 INFO mapred.JobClient:  map 100% reduce 33%
15/03/30 14:11:00 INFO mapred.JobClient:  map 100% reduce 100%
15/03/30 14:11:01 INFO mapred.JobClient: Job complete: job_201503301351_0005
15/03/30 14:11:01 INFO mapred.JobClient: Counters: 29
15/03/30 14:11:01 INFO mapred.JobClient:   Job Counters 
15/03/30 14:11:01 INFO mapred.JobClient:     Launched reduce tasks=1
15/03/30 14:11:01 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3979
15/03/30 14:11:01 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/03/30 14:11:01 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/03/30 14:11:01 INFO mapred.JobClient:     Launched map tasks=1
15/03/30 14:11:01 INFO mapred.JobClient:     Data-local map tasks=1
15/03/30 14:11:01 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8495
15/03/30 14:11:01 INFO mapred.JobClient:   File Output Format Counters 
15/03/30 14:11:01 INFO mapred.JobClient:     Bytes Written=51453
15/03/30 14:11:01 INFO mapred.JobClient:   FileSystemCounters
15/03/30 14:11:01 INFO mapred.JobClient:     FILE_BYTES_READ=35608
15/03/30 14:11:01 INFO mapred.JobClient:     HDFS_BYTES_READ=70500
15/03/30 14:11:01 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=180070
15/03/30 14:11:01 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=51453
15/03/30 14:11:01 INFO mapred.JobClient:   File Input Format Counters 
15/03/30 14:11:01 INFO mapred.JobClient:     Bytes Read=70371
15/03/30 14:11:01 INFO mapred.JobClient:   Map-Reduce Framework
15/03/30 14:11:01 INFO mapred.JobClient:     Map output materialized bytes=35608
15/03/30 14:11:01 INFO mapred.JobClient:     Map input records=149
15/03/30 14:11:01 INFO mapred.JobClient:     Reduce shuffle bytes=35608
15/03/30 14:11:01 INFO mapred.JobClient:     Spilled Records=5086
15/03/30 14:11:01 INFO mapred.JobClient:     Map output bytes=80676
15/03/30 14:11:01 INFO mapred.JobClient:     Total committed heap usage (bytes)=296222720
15/03/30 14:11:01 INFO mapred.JobClient:     CPU time spent (ms)=2120
15/03/30 14:11:01 INFO mapred.JobClient:     Combine input records=6723
15/03/30 14:11:01 INFO mapred.JobClient:     SPLIT_RAW_BYTES=129
15/03/30 14:11:01 INFO mapred.JobClient:     Reduce input records=2543
15/03/30 14:11:01 INFO mapred.JobClient:     Reduce input groups=2543
15/03/30 14:11:01 INFO mapred.JobClient:     Combine output records=2543
15/03/30 14:11:01 INFO mapred.JobClient:     Physical memory (bytes) snapshot=289153024
15/03/30 14:11:01 INFO mapred.JobClient:     Reduce output records=2543
15/03/30 14:11:01 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1364127744
15/03/30 14:11:01 INFO mapred.JobClient:     Map output records=6723
15/03/30 14:11:01 INFO vectorizer.SparseVectorsFromSequenceFiles: Pruning
15/03/30 14:11:01 INFO input.FileInputFormat: Total input paths to process : 1
15/03/30 14:11:02 INFO mapred.JobClient: Running job: job_201503301351_0006
15/03/30 14:11:03 INFO mapred.JobClient:  map 0% reduce 0%
15/03/30 14:11:08 INFO mapred.JobClient:  map 100% reduce 0%
15/03/30 14:11:15 INFO mapred.JobClient:  map 100% reduce 33%
15/03/30 14:11:16 INFO mapred.JobClient:  map 100% reduce 100%
15/03/30 14:11:17 INFO mapred.JobClient: Job complete: job_201503301351_0006
15/03/30 14:11:17 INFO mapred.JobClient: Counters: 29
15/03/30 14:11:17 INFO mapred.JobClient:   Job Counters 
15/03/30 14:11:17 INFO mapred.JobClient:     Launched reduce tasks=1
15/03/30 14:11:17 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3775
15/03/30 14:11:17 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/03/30 14:11:17 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/03/30 14:11:17 INFO mapred.JobClient:     Launched map tasks=1
15/03/30 14:11:17 INFO mapred.JobClient:     Data-local map tasks=1
15/03/30 14:11:17 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8512
15/03/30 14:11:17 INFO mapred.JobClient:   File Output Format Counters 
15/03/30 14:11:17 INFO mapred.JobClient:     Bytes Written=70371
15/03/30 14:11:17 INFO mapred.JobClient:   FileSystemCounters
15/03/30 14:11:17 INFO mapred.JobClient:     FILE_BYTES_READ=70763
15/03/30 14:11:17 INFO mapred.JobClient:     HDFS_BYTES_READ=70500
15/03/30 14:11:17 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=149132
15/03/30 14:11:17 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=70371
15/03/30 14:11:17 INFO mapred.JobClient:   File Input Format Counters 
15/03/30 14:11:17 INFO mapred.JobClient:     Bytes Read=70371
15/03/30 14:11:17 INFO mapred.JobClient:   Map-Reduce Framework
15/03/30 14:11:17 INFO mapred.JobClient:     Map output materialized bytes=18910
15/03/30 14:11:17 INFO mapred.JobClient:     Map input records=149
15/03/30 14:11:17 INFO mapred.JobClient:     Reduce shuffle bytes=18910
15/03/30 14:11:17 INFO mapred.JobClient:     Spilled Records=298
15/03/30 14:11:17 INFO mapred.JobClient:     Map output bytes=68509
15/03/30 14:11:17 INFO mapred.JobClient:     Total committed heap usage (bytes)=296222720
15/03/30 14:11:17 INFO mapred.JobClient:     CPU time spent (ms)=1710
15/03/30 14:11:17 INFO mapred.JobClient:     Combine input records=0
15/03/30 14:11:17 INFO mapred.JobClient:     SPLIT_RAW_BYTES=129
15/03/30 14:11:17 INFO mapred.JobClient:     Reduce input records=149
15/03/30 14:11:17 INFO mapred.JobClient:     Reduce input groups=149
15/03/30 14:11:17 INFO mapred.JobClient:     Combine output records=0
15/03/30 14:11:17 INFO mapred.JobClient:     Physical memory (bytes) snapshot=288608256
15/03/30 14:11:17 INFO mapred.JobClient:     Reduce output records=149
15/03/30 14:11:17 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1364774912
15/03/30 14:11:17 INFO mapred.JobClient:     Map output records=149
15/03/30 14:11:17 INFO input.FileInputFormat: Total input paths to process : 1
15/03/30 14:11:17 INFO mapred.JobClient: Running job: job_201503301351_0007
15/03/30 14:11:18 INFO mapred.JobClient:  map 0% reduce 0%
15/03/30 14:11:22 INFO mapred.JobClient:  map 100% reduce 0%
15/03/30 14:11:30 INFO mapred.JobClient:  map 100% reduce 33%
15/03/30 14:11:31 INFO mapred.JobClient:  map 100% reduce 100%
15/03/30 14:11:31 INFO mapred.JobClient: Job complete: job_201503301351_0007
15/03/30 14:11:31 INFO mapred.JobClient: Counters: 29
15/03/30 14:11:31 INFO mapred.JobClient:   Job Counters 
15/03/30 14:11:31 INFO mapred.JobClient:     Launched reduce tasks=1
15/03/30 14:11:31 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3756
15/03/30 14:11:31 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/03/30 14:11:31 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/03/30 14:11:31 INFO mapred.JobClient:     Launched map tasks=1
15/03/30 14:11:31 INFO mapred.JobClient:     Data-local map tasks=1
15/03/30 14:11:31 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8400
15/03/30 14:11:31 INFO mapred.JobClient:   File Output Format Counters 
15/03/30 14:11:31 INFO mapred.JobClient:     Bytes Written=70371
15/03/30 14:11:31 INFO mapred.JobClient:   FileSystemCounters
15/03/30 14:11:31 INFO mapred.JobClient:     FILE_BYTES_READ=69087
15/03/30 14:11:31 INFO mapred.JobClient:     HDFS_BYTES_READ=70510
15/03/30 14:11:31 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=247208
15/03/30 14:11:31 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=70371
15/03/30 14:11:31 INFO mapred.JobClient:   File Input Format Counters 
15/03/30 14:11:31 INFO mapred.JobClient:     Bytes Read=70371
15/03/30 14:11:31 INFO mapred.JobClient:   Map-Reduce Framework
15/03/30 14:11:31 INFO mapred.JobClient:     Map output materialized bytes=69087
15/03/30 14:11:31 INFO mapred.JobClient:     Map input records=149
15/03/30 14:11:31 INFO mapred.JobClient:     Reduce shuffle bytes=69087
15/03/30 14:11:31 INFO mapred.JobClient:     Spilled Records=298
15/03/30 14:11:31 INFO mapred.JobClient:     Map output bytes=68509
15/03/30 14:11:31 INFO mapred.JobClient:     Total committed heap usage (bytes)=296222720
15/03/30 14:11:31 INFO mapred.JobClient:     CPU time spent (ms)=1530
15/03/30 14:11:31 INFO mapred.JobClient:     Combine input records=0
15/03/30 14:11:31 INFO mapred.JobClient:     SPLIT_RAW_BYTES=139
15/03/30 14:11:31 INFO mapred.JobClient:     Reduce input records=149
15/03/30 14:11:31 INFO mapred.JobClient:     Reduce input groups=149
15/03/30 14:11:31 INFO mapred.JobClient:     Combine output records=0
15/03/30 14:11:31 INFO mapred.JobClient:     Physical memory (bytes) snapshot=288825344
15/03/30 14:11:31 INFO mapred.JobClient:     Reduce output records=149
15/03/30 14:11:31 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1367453696
15/03/30 14:11:31 INFO mapred.JobClient:     Map output records=149
15/03/30 14:11:31 INFO common.HadoopUtil: Deleting output/tf-vectors-partial
15/03/30 14:11:31 INFO common.HadoopUtil: Deleting output/tf-vectors-toprune
15/03/30 14:11:31 INFO input.FileInputFormat: Total input paths to process : 1
15/03/30 14:11:31 INFO mapred.JobClient: Running job: job_201503301351_0008
15/03/30 14:11:32 INFO mapred.JobClient:  map 0% reduce 0%
15/03/30 14:11:37 INFO mapred.JobClient:  map 100% reduce 0%
15/03/30 14:11:44 INFO mapred.JobClient:  map 100% reduce 33%
15/03/30 14:11:45 INFO mapred.JobClient:  map 100% reduce 100%
15/03/30 14:11:46 INFO mapred.JobClient: Job complete: job_201503301351_0008
15/03/30 14:11:46 INFO mapred.JobClient: Counters: 29
15/03/30 14:11:46 INFO mapred.JobClient:   Job Counters 
15/03/30 14:11:46 INFO mapred.JobClient:     Launched reduce tasks=1
15/03/30 14:11:46 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3788
15/03/30 14:11:46 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/03/30 14:11:46 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/03/30 14:11:46 INFO mapred.JobClient:     Launched map tasks=1
15/03/30 14:11:46 INFO mapred.JobClient:     Data-local map tasks=1
15/03/30 14:11:46 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8494
15/03/30 14:11:46 INFO mapred.JobClient:   File Output Format Counters 
15/03/30 14:11:46 INFO mapred.JobClient:     Bytes Written=70371
15/03/30 14:11:46 INFO mapred.JobClient:   FileSystemCounters
15/03/30 14:11:46 INFO mapred.JobClient:     FILE_BYTES_READ=120932
15/03/30 14:11:46 INFO mapred.JobClient:     HDFS_BYTES_READ=70492
15/03/30 14:11:46 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=250986
15/03/30 14:11:46 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=70371
15/03/30 14:11:46 INFO mapred.JobClient:   File Input Format Counters 
15/03/30 14:11:46 INFO mapred.JobClient:     Bytes Read=70371
15/03/30 14:11:46 INFO mapred.JobClient:   Map-Reduce Framework
15/03/30 14:11:46 INFO mapred.JobClient:     Map output materialized bytes=69087
15/03/30 14:11:46 INFO mapred.JobClient:     Map input records=149
15/03/30 14:11:46 INFO mapred.JobClient:     Reduce shuffle bytes=69087
15/03/30 14:11:46 INFO mapred.JobClient:     Spilled Records=298
15/03/30 14:11:46 INFO mapred.JobClient:     Map output bytes=68509
15/03/30 14:11:46 INFO mapred.JobClient:     Total committed heap usage (bytes)=296222720
15/03/30 14:11:46 INFO mapred.JobClient:     CPU time spent (ms)=1570
15/03/30 14:11:46 INFO mapred.JobClient:     Combine input records=0
15/03/30 14:11:46 INFO mapred.JobClient:     SPLIT_RAW_BYTES=121
15/03/30 14:11:46 INFO mapred.JobClient:     Reduce input records=149
15/03/30 14:11:46 INFO mapred.JobClient:     Reduce input groups=149
15/03/30 14:11:46 INFO mapred.JobClient:     Combine output records=0
15/03/30 14:11:46 INFO mapred.JobClient:     Physical memory (bytes) snapshot=289206272
15/03/30 14:11:46 INFO mapred.JobClient:     Reduce output records=149
15/03/30 14:11:46 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1363017728
15/03/30 14:11:46 INFO mapred.JobClient:     Map output records=149
15/03/30 14:11:46 INFO input.FileInputFormat: Total input paths to process : 1
15/03/30 14:11:47 INFO mapred.JobClient: Running job: job_201503301351_0009
15/03/30 14:11:48 INFO mapred.JobClient:  map 0% reduce 0%
15/03/30 14:11:52 INFO mapred.JobClient:  map 100% reduce 0%
15/03/30 14:11:59 INFO mapred.JobClient:  map 100% reduce 33%
15/03/30 14:12:01 INFO mapred.JobClient:  map 100% reduce 100%
15/03/30 14:12:01 INFO mapred.JobClient: Job complete: job_201503301351_0009
15/03/30 14:12:01 INFO mapred.JobClient: Counters: 29
15/03/30 14:12:01 INFO mapred.JobClient:   Job Counters 
15/03/30 14:12:01 INFO mapred.JobClient:     Launched reduce tasks=1
15/03/30 14:12:01 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3728
15/03/30 14:12:01 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/03/30 14:12:01 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/03/30 14:12:01 INFO mapred.JobClient:     Launched map tasks=1
15/03/30 14:12:01 INFO mapred.JobClient:     Data-local map tasks=1
15/03/30 14:12:01 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8502
15/03/30 14:12:01 INFO mapred.JobClient:   File Output Format Counters 
15/03/30 14:12:01 INFO mapred.JobClient:     Bytes Written=70371
15/03/30 14:12:01 INFO mapred.JobClient:   FileSystemCounters
15/03/30 14:12:01 INFO mapred.JobClient:     FILE_BYTES_READ=69087
15/03/30 14:12:01 INFO mapred.JobClient:     HDFS_BYTES_READ=70499
15/03/30 14:12:01 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=248294
15/03/30 14:12:01 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=70371
15/03/30 14:12:01 INFO mapred.JobClient:   File Input Format Counters 
15/03/30 14:12:01 INFO mapred.JobClient:     Bytes Read=70371
15/03/30 14:12:01 INFO mapred.JobClient:   Map-Reduce Framework
15/03/30 14:12:01 INFO mapred.JobClient:     Map output materialized bytes=69087
15/03/30 14:12:01 INFO mapred.JobClient:     Map input records=149
15/03/30 14:12:01 INFO mapred.JobClient:     Reduce shuffle bytes=69087
15/03/30 14:12:01 INFO mapred.JobClient:     Spilled Records=298
15/03/30 14:12:01 INFO mapred.JobClient:     Map output bytes=68509
15/03/30 14:12:01 INFO mapred.JobClient:     Total committed heap usage (bytes)=296222720
15/03/30 14:12:01 INFO mapred.JobClient:     CPU time spent (ms)=2130
15/03/30 14:12:01 INFO mapred.JobClient:     Combine input records=0
15/03/30 14:12:01 INFO mapred.JobClient:     SPLIT_RAW_BYTES=128
15/03/30 14:12:01 INFO mapred.JobClient:     Reduce input records=149
15/03/30 14:12:01 INFO mapred.JobClient:     Reduce input groups=149
15/03/30 14:12:01 INFO mapred.JobClient:     Combine output records=0
15/03/30 14:12:01 INFO mapred.JobClient:     Physical memory (bytes) snapshot=301170688
15/03/30 14:12:01 INFO mapred.JobClient:     Reduce output records=149
15/03/30 14:12:01 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1368752128
15/03/30 14:12:01 INFO mapred.JobClient:     Map output records=149
15/03/30 14:12:01 INFO common.HadoopUtil: Deleting output/partial-vectors-0
15/03/30 14:12:01 INFO driver.MahoutDriver: Program took 130017 ms (Minutes: 2.16695)


====================================

root@xin:~# hadoop fs -ls output
Warning: $HADOOP_HOME is deprecated.

Found 7 items
drwxr-xr-x   - root supergroup          0 2015-03-30 14:11 /user/root/output/df-count
-rw-r--r--   1 root supergroup      48768 2015-03-30 14:10 /user/root/output/dictionary.file-0
-rw-r--r--   1 root supergroup      51433 2015-03-30 14:11 /user/root/output/frequency.file-0
drwxr-xr-x   - root supergroup          0 2015-03-30 14:11 /user/root/output/tf-vectors
drwxr-xr-x   - root supergroup          0 2015-03-30 14:12 /user/root/output/tfidf-vectors
drwxr-xr-x   - root supergroup          0 2015-03-30 14:09 /user/root/output/tokenized-documents
drwxr-xr-x   - root supergroup          0 2015-03-30 14:10 /user/root/output/wordcount


==========================================



root@xin:~# mahout kmeans -i output/tf-vectors -c output/kmeans-clusters -o output/kmeans -k 10 -x 200 -ow --clustering
Warning: $HADOOP_HOME is deprecated.

Running on hadoop, using /usr/local/hadoop-1.1.2/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /usr/local/mahout-distribution-0.9/mahout-examples-0.9-job.jar
Warning: $HADOOP_HOME is deprecated.

15/03/30 14:17:04 INFO common.AbstractJob: Command line arguments: {--clustering=null, --clusters=[output/kmeans-clusters], --convergenceDelta=[0.5], --distanceMeasure=[org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure], --endPhase=[2147483647], --input=[output/tf-vectors], --maxIter=[200], --method=[mapreduce], --numClusters=[10], --output=[output/kmeans], --overwrite=null, --startPhase=[0], --tempDir=[temp]}
15/03/30 14:17:04 INFO util.NativeCodeLoader: Loaded the native-hadoop library
15/03/30 14:17:04 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
15/03/30 14:17:04 INFO compress.CodecPool: Got brand-new compressor
15/03/30 14:17:04 INFO kmeans.RandomSeedGenerator: Wrote 10 Klusters to output/kmeans-clusters/part-randomSeed
15/03/30 14:17:04 INFO kmeans.KMeansDriver: Input: output/tf-vectors Clusters In: output/kmeans-clusters/part-randomSeed Out: output/kmeans
15/03/30 14:17:04 INFO kmeans.KMeansDriver: convergence: 0.5 max Iterations: 200
15/03/30 14:17:04 INFO compress.CodecPool: Got brand-new decompressor
15/03/30 14:17:05 INFO input.FileInputFormat: Total input paths to process : 1
15/03/30 14:17:05 INFO mapred.JobClient: Running job: job_201503301351_0010
15/03/30 14:17:06 INFO mapred.JobClient:  map 0% reduce 0%
15/03/30 14:17:11 INFO mapred.JobClient:  map 100% reduce 0%
15/03/30 14:17:18 INFO mapred.JobClient:  map 100% reduce 33%
15/03/30 14:17:19 INFO mapred.JobClient:  map 100% reduce 100%
15/03/30 14:17:20 INFO mapred.JobClient: Job complete: job_201503301351_0010
15/03/30 14:17:20 INFO mapred.JobClient: Counters: 29
15/03/30 14:17:20 INFO mapred.JobClient:   Job Counters 
15/03/30 14:17:20 INFO mapred.JobClient:     Launched reduce tasks=1
15/03/30 14:17:20 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=4154
15/03/30 14:17:20 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/03/30 14:17:20 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/03/30 14:17:20 INFO mapred.JobClient:     Launched map tasks=1
15/03/30 14:17:20 INFO mapred.JobClient:     Data-local map tasks=1
15/03/30 14:17:20 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8558
15/03/30 14:17:20 INFO mapred.JobClient:   File Output Format Counters 
15/03/30 14:17:20 INFO mapred.JobClient:     Bytes Written=64996
15/03/30 14:17:20 INFO mapred.JobClient:   FileSystemCounters
15/03/30 14:17:20 INFO mapred.JobClient:     FILE_BYTES_READ=70419
15/03/30 14:17:20 INFO mapred.JobClient:     HDFS_BYTES_READ=96550
15/03/30 14:17:20 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=250490
15/03/30 14:17:20 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=64996
15/03/30 14:17:20 INFO mapred.JobClient:   File Input Format Counters 
15/03/30 14:17:20 INFO mapred.JobClient:     Bytes Read=70371
15/03/30 14:17:20 INFO mapred.JobClient:   Map-Reduce Framework
15/03/30 14:17:20 INFO mapred.JobClient:     Map output materialized bytes=70419
15/03/30 14:17:20 INFO mapred.JobClient:     Map input records=149
15/03/30 14:17:20 INFO mapred.JobClient:     Reduce shuffle bytes=70419
15/03/30 14:17:20 INFO mapred.JobClient:     Spilled Records=20
15/03/30 14:17:20 INFO mapred.JobClient:     Map output bytes=70373
15/03/30 14:17:20 INFO mapred.JobClient:     Total committed heap usage (bytes)=296222720
15/03/30 14:17:20 INFO mapred.JobClient:     CPU time spent (ms)=2950
15/03/30 14:17:20 INFO mapred.JobClient:     Combine input records=0
15/03/30 14:17:20 INFO mapred.JobClient:     SPLIT_RAW_BYTES=121
15/03/30 14:17:20 INFO mapred.JobClient:     Reduce input records=10
15/03/30 14:17:20 INFO mapred.JobClient:     Reduce input groups=10
15/03/30 14:17:20 INFO mapred.JobClient:     Combine output records=0
15/03/30 14:17:20 INFO mapred.JobClient:     Physical memory (bytes) snapshot=306675712
15/03/30 14:17:20 INFO mapred.JobClient:     Reduce output records=10
15/03/30 14:17:20 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1367547904
15/03/30 14:17:20 INFO mapred.JobClient:     Map output records=10
15/03/30 14:17:20 INFO input.FileInputFormat: Total input paths to process : 1
15/03/30 14:17:20 INFO mapred.JobClient: Running job: job_201503301351_0011
15/03/30 14:17:21 INFO mapred.JobClient:  map 0% reduce 0%
15/03/30 14:17:26 INFO mapred.JobClient:  map 100% reduce 0%
15/03/30 14:17:34 INFO mapred.JobClient:  map 100% reduce 33%
15/03/30 14:17:36 INFO mapred.JobClient:  map 100% reduce 100%
15/03/30 14:17:36 INFO mapred.JobClient: Job complete: job_201503301351_0011
15/03/30 14:17:36 INFO mapred.JobClient: Counters: 29
15/03/30 14:17:36 INFO mapred.JobClient:   Job Counters 
15/03/30 14:17:36 INFO mapred.JobClient:     Launched reduce tasks=1
15/03/30 14:17:36 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=4041
15/03/30 14:17:36 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/03/30 14:17:36 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/03/30 14:17:36 INFO mapred.JobClient:     Launched map tasks=1
15/03/30 14:17:36 INFO mapred.JobClient:     Data-local map tasks=1
15/03/30 14:17:36 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8708
15/03/30 14:17:36 INFO mapred.JobClient:   File Output Format Counters 
15/03/30 14:17:36 INFO mapred.JobClient:     Bytes Written=64018
15/03/30 14:17:36 INFO mapred.JobClient:   FileSystemCounters
15/03/30 14:17:36 INFO mapred.JobClient:     FILE_BYTES_READ=128966
15/03/30 14:17:36 INFO mapred.JobClient:     HDFS_BYTES_READ=200872
15/03/30 14:17:36 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=367584
15/03/30 14:17:36 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=64018
15/03/30 14:17:36 INFO mapred.JobClient:   File Input Format Counters 
15/03/30 14:17:36 INFO mapred.JobClient:     Bytes Read=70371
15/03/30 14:17:36 INFO mapred.JobClient:   Map-Reduce Framework
15/03/30 14:17:36 INFO mapred.JobClient:     Map output materialized bytes=128966
15/03/30 14:17:36 INFO mapred.JobClient:     Map input records=149
15/03/30 14:17:36 INFO mapred.JobClient:     Reduce shuffle bytes=128966
15/03/30 14:17:36 INFO mapred.JobClient:     Spilled Records=20
15/03/30 14:17:36 INFO mapred.JobClient:     Map output bytes=128919
15/03/30 14:17:36 INFO mapred.JobClient:     Total committed heap usage (bytes)=296222720
15/03/30 14:17:36 INFO mapred.JobClient:     CPU time spent (ms)=3050
15/03/30 14:17:36 INFO mapred.JobClient:     Combine input records=0
15/03/30 14:17:36 INFO mapred.JobClient:     SPLIT_RAW_BYTES=121
15/03/30 14:17:36 INFO mapred.JobClient:     Reduce input records=10
15/03/30 14:17:36 INFO mapred.JobClient:     Reduce input groups=10
15/03/30 14:17:36 INFO mapred.JobClient:     Combine output records=0
15/03/30 14:17:36 INFO mapred.JobClient:     Physical memory (bytes) snapshot=301654016
15/03/30 14:17:36 INFO mapred.JobClient:     Reduce output records=10
15/03/30 14:17:36 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1368375296
15/03/30 14:17:36 INFO mapred.JobClient:     Map output records=10
15/03/30 14:17:36 INFO input.FileInputFormat: Total input paths to process : 1
15/03/30 14:17:36 INFO mapred.JobClient: Running job: job_201503301351_0012
15/03/30 14:17:37 INFO mapred.JobClient:  map 0% reduce 0%
15/03/30 14:17:42 INFO mapred.JobClient:  map 100% reduce 0%
15/03/30 14:17:49 INFO mapred.JobClient:  map 100% reduce 33%
15/03/30 14:17:50 INFO mapred.JobClient:  map 100% reduce 100%
15/03/30 14:17:51 INFO mapred.JobClient: Job complete: job_201503301351_0012
15/03/30 14:17:51 INFO mapred.JobClient: Counters: 29
15/03/30 14:17:51 INFO mapred.JobClient:   Job Counters 
15/03/30 14:17:51 INFO mapred.JobClient:     Launched reduce tasks=1
15/03/30 14:17:51 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=4081
15/03/30 14:17:51 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/03/30 14:17:51 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/03/30 14:17:51 INFO mapred.JobClient:     Launched map tasks=1
15/03/30 14:17:51 INFO mapred.JobClient:     Data-local map tasks=1
15/03/30 14:17:51 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8601
15/03/30 14:17:51 INFO mapred.JobClient:   File Output Format Counters 
15/03/30 14:17:51 INFO mapred.JobClient:     Bytes Written=61455
15/03/30 14:17:51 INFO mapred.JobClient:   FileSystemCounters
15/03/30 14:17:51 INFO mapred.JobClient:     FILE_BYTES_READ=125434
15/03/30 14:17:51 INFO mapred.JobClient:     HDFS_BYTES_READ=198916
15/03/30 14:17:51 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=360520
15/03/30 14:17:51 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=61455
15/03/30 14:17:51 INFO mapred.JobClient:   File Input Format Counters 
15/03/30 14:17:51 INFO mapred.JobClient:     Bytes Read=70371
15/03/30 14:17:51 INFO mapred.JobClient:   Map-Reduce Framework
15/03/30 14:17:51 INFO mapred.JobClient:     Map output materialized bytes=125434
15/03/30 14:17:51 INFO mapred.JobClient:     Map input records=149
15/03/30 14:17:51 INFO mapred.JobClient:     Reduce shuffle bytes=125434
15/03/30 14:17:51 INFO mapred.JobClient:     Spilled Records=20
15/03/30 14:17:51 INFO mapred.JobClient:     Map output bytes=125387
15/03/30 14:17:51 INFO mapred.JobClient:     Total committed heap usage (bytes)=296222720
15/03/30 14:17:51 INFO mapred.JobClient:     CPU time spent (ms)=2850
15/03/30 14:17:51 INFO mapred.JobClient:     Combine input records=0
15/03/30 14:17:51 INFO mapred.JobClient:     SPLIT_RAW_BYTES=121
15/03/30 14:17:51 INFO mapred.JobClient:     Reduce input records=10
15/03/30 14:17:51 INFO mapred.JobClient:     Reduce input groups=10
15/03/30 14:17:51 INFO mapred.JobClient:     Combine output records=0
15/03/30 14:17:51 INFO mapred.JobClient:     Physical memory (bytes) snapshot=298000384
15/03/30 14:17:51 INFO mapred.JobClient:     Reduce output records=10
15/03/30 14:17:51 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1369067520
15/03/30 14:17:51 INFO mapred.JobClient:     Map output records=10
15/03/30 14:17:51 INFO input.FileInputFormat: Total input paths to process : 1
15/03/30 14:17:51 INFO mapred.JobClient: Running job: job_201503301351_0013
15/03/30 14:17:52 INFO mapred.JobClient:  map 0% reduce 0%
15/03/30 14:17:57 INFO mapred.JobClient:  map 100% reduce 0%
15/03/30 14:18:04 INFO mapred.JobClient:  map 100% reduce 33%
15/03/30 14:18:06 INFO mapred.JobClient:  map 100% reduce 100%
15/03/30 14:18:06 INFO mapred.JobClient: Job complete: job_201503301351_0013
15/03/30 14:18:06 INFO mapred.JobClient: Counters: 29
15/03/30 14:18:06 INFO mapred.JobClient:   Job Counters 
15/03/30 14:18:06 INFO mapred.JobClient:     Launched reduce tasks=1
15/03/30 14:18:06 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=4191
15/03/30 14:18:06 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/03/30 14:18:06 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/03/30 14:18:06 INFO mapred.JobClient:     Launched map tasks=1
15/03/30 14:18:06 INFO mapred.JobClient:     Data-local map tasks=1
15/03/30 14:18:06 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8661
15/03/30 14:18:06 INFO mapred.JobClient:   File Output Format Counters 
15/03/30 14:18:06 INFO mapred.JobClient:     Bytes Written=61248
15/03/30 14:18:06 INFO mapred.JobClient:   FileSystemCounters
15/03/30 14:18:06 INFO mapred.JobClient:     FILE_BYTES_READ=121841
15/03/30 14:18:06 INFO mapred.JobClient:     HDFS_BYTES_READ=193790
15/03/30 14:18:06 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=353334
15/03/30 14:18:06 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=61248
15/03/30 14:18:06 INFO mapred.JobClient:   File Input Format Counters 
15/03/30 14:18:06 INFO mapred.JobClient:     Bytes Read=70371
15/03/30 14:18:06 INFO mapred.JobClient:   Map-Reduce Framework
15/03/30 14:18:06 INFO mapred.JobClient:     Map output materialized bytes=121841
15/03/30 14:18:06 INFO mapred.JobClient:     Map input records=149
15/03/30 14:18:06 INFO mapred.JobClient:     Reduce shuffle bytes=121841
15/03/30 14:18:06 INFO mapred.JobClient:     Spilled Records=20
15/03/30 14:18:06 INFO mapred.JobClient:     Map output bytes=121794
15/03/30 14:18:06 INFO mapred.JobClient:     Total committed heap usage (bytes)=296222720
15/03/30 14:18:06 INFO mapred.JobClient:     CPU time spent (ms)=3380
15/03/30 14:18:06 INFO mapred.JobClient:     Combine input records=0
15/03/30 14:18:06 INFO mapred.JobClient:     SPLIT_RAW_BYTES=121
15/03/30 14:18:06 INFO mapred.JobClient:     Reduce input records=10
15/03/30 14:18:06 INFO mapred.JobClient:     Reduce input groups=10
15/03/30 14:18:06 INFO mapred.JobClient:     Combine output records=0
15/03/30 14:18:06 INFO mapred.JobClient:     Physical memory (bytes) snapshot=306253824
15/03/30 14:18:06 INFO mapred.JobClient:     Reduce output records=10
15/03/30 14:18:06 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1372364800
15/03/30 14:18:06 INFO mapred.JobClient:     Map output records=10
15/03/30 14:18:06 INFO input.FileInputFormat: Total input paths to process : 1
15/03/30 14:18:06 INFO mapred.JobClient: Running job: job_201503301351_0014
15/03/30 14:18:07 INFO mapred.JobClient:  map 0% reduce 0%
15/03/30 14:18:12 INFO mapred.JobClient:  map 100% reduce 0%
15/03/30 14:18:19 INFO mapred.JobClient:  map 100% reduce 33%
15/03/30 14:18:21 INFO mapred.JobClient:  map 100% reduce 100%
15/03/30 14:18:21 INFO mapred.JobClient: Job complete: job_201503301351_0014
15/03/30 14:18:21 INFO mapred.JobClient: Counters: 29
15/03/30 14:18:21 INFO mapred.JobClient:   Job Counters 
15/03/30 14:18:21 INFO mapred.JobClient:     Launched reduce tasks=1
15/03/30 14:18:21 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=4242
15/03/30 14:18:21 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/03/30 14:18:21 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/03/30 14:18:21 INFO mapred.JobClient:     Launched map tasks=1
15/03/30 14:18:21 INFO mapred.JobClient:     Data-local map tasks=1
15/03/30 14:18:21 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8624
15/03/30 14:18:21 INFO mapred.JobClient:   File Output Format Counters 
15/03/30 14:18:21 INFO mapred.JobClient:     Bytes Written=61248
15/03/30 14:18:21 INFO mapred.JobClient:   FileSystemCounters
15/03/30 14:18:21 INFO mapred.JobClient:     FILE_BYTES_READ=121634
15/03/30 14:18:21 INFO mapred.JobClient:     HDFS_BYTES_READ=193376
15/03/30 14:18:21 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=352920
15/03/30 14:18:21 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=61248
15/03/30 14:18:21 INFO mapred.JobClient:   File Input Format Counters 
15/03/30 14:18:21 INFO mapred.JobClient:     Bytes Read=70371
15/03/30 14:18:21 INFO mapred.JobClient:   Map-Reduce Framework
15/03/30 14:18:21 INFO mapred.JobClient:     Map output materialized bytes=121634
15/03/30 14:18:21 INFO mapred.JobClient:     Map input records=149
15/03/30 14:18:21 INFO mapred.JobClient:     Reduce shuffle bytes=121634
15/03/30 14:18:21 INFO mapred.JobClient:     Spilled Records=20
15/03/30 14:18:21 INFO mapred.JobClient:     Map output bytes=121587
15/03/30 14:18:21 INFO mapred.JobClient:     Total committed heap usage (bytes)=296222720
15/03/30 14:18:21 INFO mapred.JobClient:     CPU time spent (ms)=3060
15/03/30 14:18:21 INFO mapred.JobClient:     Combine input records=0
15/03/30 14:18:21 INFO mapred.JobClient:     SPLIT_RAW_BYTES=121
15/03/30 14:18:21 INFO mapred.JobClient:     Reduce input records=10
15/03/30 14:18:21 INFO mapred.JobClient:     Reduce input groups=10
15/03/30 14:18:21 INFO mapred.JobClient:     Combine output records=0
15/03/30 14:18:21 INFO mapred.JobClient:     Physical memory (bytes) snapshot=295936000
15/03/30 14:18:21 INFO mapred.JobClient:     Reduce output records=10
15/03/30 14:18:21 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1362153472
15/03/30 14:18:21 INFO mapred.JobClient:     Map output records=10
15/03/30 14:18:21 INFO kmeans.KMeansDriver: Clustering data
15/03/30 14:18:21 INFO kmeans.KMeansDriver: Running Clustering
15/03/30 14:18:21 INFO kmeans.KMeansDriver: Input: output/tf-vectors Clusters In: output/kmeans Out: output/kmeans
15/03/30 14:18:22 INFO input.FileInputFormat: Total input paths to process : 1
15/03/30 14:18:22 INFO mapred.JobClient: Running job: job_201503301351_0015
15/03/30 14:18:23 INFO mapred.JobClient:  map 0% reduce 0%
15/03/30 14:18:29 INFO mapred.JobClient:  map 100% reduce 0%
15/03/30 14:18:30 INFO mapred.JobClient: Job complete: job_201503301351_0015
15/03/30 14:18:30 INFO mapred.JobClient: Counters: 19
15/03/30 14:18:30 INFO mapred.JobClient:   Job Counters 
15/03/30 14:18:30 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=5264
15/03/30 14:18:30 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/03/30 14:18:30 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/03/30 14:18:30 INFO mapred.JobClient:     Launched map tasks=1
15/03/30 14:18:30 INFO mapred.JobClient:     Data-local map tasks=1
15/03/30 14:18:30 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
15/03/30 14:18:30 INFO mapred.JobClient:   File Output Format Counters 
15/03/30 14:18:30 INFO mapred.JobClient:     Bytes Written=75851
15/03/30 14:18:30 INFO mapred.JobClient:   FileSystemCounters
15/03/30 14:18:30 INFO mapred.JobClient:     HDFS_BYTES_READ=131934
15/03/30 14:18:30 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54540
15/03/30 14:18:30 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=75851
15/03/30 14:18:30 INFO mapred.JobClient:   File Input Format Counters 
15/03/30 14:18:30 INFO mapred.JobClient:     Bytes Read=70371
15/03/30 14:18:30 INFO mapred.JobClient:   Map-Reduce Framework
15/03/30 14:18:30 INFO mapred.JobClient:     Map input records=149
15/03/30 14:18:30 INFO mapred.JobClient:     Physical memory (bytes) snapshot=113307648
15/03/30 14:18:30 INFO mapred.JobClient:     Spilled Records=0
15/03/30 14:18:30 INFO mapred.JobClient:     CPU time spent (ms)=1620
15/03/30 14:18:30 INFO mapred.JobClient:     Total committed heap usage (bytes)=120061952
15/03/30 14:18:30 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=680222720
15/03/30 14:18:30 INFO mapred.JobClient:     Map output records=149
15/03/30 14:18:30 INFO mapred.JobClient:     SPLIT_RAW_BYTES=121
15/03/30 14:18:30 INFO driver.MahoutDriver: Program took 86159 ms (Minutes: 1.4359833333333334)


======================================

root@xin:~# hadoop fs -ls output/kmeans
Warning: $HADOOP_HOME is deprecated.

Found 8 items
-rw-r--r--   1 root supergroup        194 2015-03-30 14:18 /user/root/output/kmeans/_policy
drwxr-xr-x   - root supergroup          0 2015-03-30 14:18 /user/root/output/kmeans/clusteredPoints
drwxr-xr-x   - root supergroup          0 2015-03-30 14:17 /user/root/output/kmeans/clusters-0
drwxr-xr-x   - root supergroup          0 2015-03-30 14:17 /user/root/output/kmeans/clusters-1
drwxr-xr-x   - root supergroup          0 2015-03-30 14:17 /user/root/output/kmeans/clusters-2
drwxr-xr-x   - root supergroup          0 2015-03-30 14:17 /user/root/output/kmeans/clusters-3
drwxr-xr-x   - root supergroup          0 2015-03-30 14:18 /user/root/output/kmeans/clusters-4
drwxr-xr-x   - root supergroup          0 2015-03-30 14:18 /user/root/output/kmeans/clusters-5-final


======================================

root@xin:~# hadoop fs -get output/kmeans/* /usr/song-kmeans/
Warning: $HADOOP_HOME is deprecated.

root@xin:~# hadoop fs -get output/dictionary.file-0 /usr/song-kmeans
Warning: $HADOOP_HOME is deprecated.

root@xin:~# mahout clusterdump -i file:///usr/song-kmeans/clusters-5-final  -d file:///usr/song-kmeans/dictionary.file-0 -dt sequencefile -o /usr/song-result/result  -n 20
Warning: $HADOOP_HOME is deprecated.

Running on hadoop, using /usr/local/hadoop-1.1.2/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /usr/local/mahout-distribution-0.9/mahout-examples-0.9-job.jar
Warning: $HADOOP_HOME is deprecated.

15/03/30 14:34:08 INFO common.AbstractJob: Command line arguments: {--dictionary=[file:///usr/song-kmeans/dictionary.file-0], --dictionaryType=[sequencefile], --distanceMeasure=[org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure], --endPhase=[2147483647], --input=[file:///usr/song-kmeans/clusters-5-final], --numWords=[20], --output=[/usr/song-result/result], --outputFormat=[TEXT], --startPhase=[0], --tempDir=[temp]}
15/03/30 14:34:09 INFO clustering.ClusterDumper: Wrote 10 clusters
15/03/30 14:34:09 INFO driver.MahoutDriver: Program took 716 ms (Minutes: 0.011933333333333334)





Exception in thread "main" java.io.FileNotFoundException: /usr/song-result (Is a directory)
	at java.io.FileOutputStream.open(Native Method)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:171)
	at com.google.common.io.Files.newWriter(Files.java:103)
	at org.apache.mahout.utils.clustering.ClusterDumper.printClusters(ClusterDumper.java:187)
	at org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:157)
	at org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:101)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

===========================================



root@xin:~# mahout seqdumper -i file:///usr/song-kmeans/clusteredPoints  -o /usr/song-result/all 
Warning: $HADOOP_HOME is deprecated.

Running on hadoop, using /usr/local/hadoop-1.1.2/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /usr/local/mahout-distribution-0.9/mahout-examples-0.9-job.jar
Warning: $HADOOP_HOME is deprecated.

15/03/30 14:44:28 INFO common.AbstractJob: Command line arguments: {--endPhase=[2147483647], --input=[file:///usr/song-kmeans/clusteredPoints], --output=[/usr/song-result/all], --startPhase=[0], --tempDir=[temp]}
15/03/30 14:44:29 INFO driver.MahoutDriver: Program took 634 ms (Minutes: 0.010566666666666667)


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章