Feature Extractors 2:Word2Vec

Feature Extractors 2:Word2Vec

簡介

Word2Vec is an Estimator which takes sequences of words representing documents and trains a Word2VecModel. The model maps each word to a unique fixed-size vector. The Word2VecModel transforms each document into a vector using the average of all words in the document; this vector can then be used as features for prediction, document similarity calculations, etc. Please refer to the MLlib user guide on Word2Vec for more details.

代碼

    val documentDF = spark.createDataFrame(Seq(
      "Hi I heard about Spark".split(" "),
      "I wish Java could use case classes".split(" "),
      "Logistic regression models are neat".split(" ")
    ).map(Tuple1.apply)).toDF("text")
    
    // Learn a mapping from words to Vectors.
    val word2Vec = new Word2Vec()
      .setInputCol("text")
      .setOutputCol("result")
      .setVectorSize(3)
      .setMinCount(0)
    val model = word2Vec.fit(documentDF)
    
    val result = model.transform(documentDF)
    
    result.show(false)
+------------------------------------------+-----------------------------------------------------------------+
|text                                      |result                                                           |
+------------------------------------------+-----------------------------------------------------------------+
|[Hi, I, heard, about, Spark]              |[-0.028139343485236168,0.04554025698453188,-0.013317196490243079]|
|[I, wish, Java, could, use, case, classes]|[0.06872416580361979,-0.02604914902310286,0.02165239889706884]   |
|[Logistic, regression, models, are, neat] |[0.023467857390642166,0.027799883112311366,0.0331136979162693]   |
+------------------------------------------+-----------------------------------------------------------------+

國王 皇后 男人 女人

+-------+----------------------------------------------------------------+
|text   |result                                                          |
+-------+----------------------------------------------------------------+
|[king] |[0.11598825454711914,0.10368897765874863,-0.06978098303079605]  |
|[queen]|[-0.019123395904898643,-0.13107778131961823,0.14307855069637299]|
|[man]  |[-0.12674053013324738,0.09846510738134384,-0.10375533252954483] |
|[women]|[-0.1633371263742447,-0.14517612755298615,0.11354436725378036]  |
+-------+----------------------------------------------------------------+

C(king)−C(queen)≈C(man)−C(woman)

平均

+---------------+---------------------------------------------------------------+
|text           |result                                                         |
+---------------+---------------------------------------------------------------+
|[I, like, java]|[-0.10301876937349637,-0.05928801248470942,0.05098938445250193]|
|[I]            |[-0.01903184875845909,-0.13106320798397064,0.14307551085948944]|
|[like]         |[-0.16329725086688995,-0.14520978927612305,0.11358748376369476]|
|[java]         |[-0.12672720849514008,0.09840895980596542,-0.1036948412656784] |
+---------------+---------------------------------------------------------------+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章