Feature Extractors 2:Word2Vec
簡介
Word2Vec is an Estimator which takes sequences of words representing documents and trains a Word2VecModel. The model maps each word to a unique fixed-size vector. The Word2VecModel transforms each document into a vector using the average of all words in the document; this vector can then be used as features for prediction, document similarity calculations, etc. Please refer to the MLlib user guide on Word2Vec for more details.
代碼
val documentDF = spark.createDataFrame(Seq(
"Hi I heard about Spark".split(" "),
"I wish Java could use case classes".split(" "),
"Logistic regression models are neat".split(" ")
).map(Tuple1.apply)).toDF("text")
// Learn a mapping from words to Vectors.
val word2Vec = new Word2Vec()
.setInputCol("text")
.setOutputCol("result")
.setVectorSize(3)
.setMinCount(0)
val model = word2Vec.fit(documentDF)
val result = model.transform(documentDF)
result.show(false)
+------------------------------------------+-----------------------------------------------------------------+
|text |result |
+------------------------------------------+-----------------------------------------------------------------+
|[Hi, I, heard, about, Spark] |[-0.028139343485236168,0.04554025698453188,-0.013317196490243079]|
|[I, wish, Java, could, use, case, classes]|[0.06872416580361979,-0.02604914902310286,0.02165239889706884] |
|[Logistic, regression, models, are, neat] |[0.023467857390642166,0.027799883112311366,0.0331136979162693] |
+------------------------------------------+-----------------------------------------------------------------+
國王 皇后 男人 女人
+-------+----------------------------------------------------------------+
|text |result |
+-------+----------------------------------------------------------------+
|[king] |[0.11598825454711914,0.10368897765874863,-0.06978098303079605] |
|[queen]|[-0.019123395904898643,-0.13107778131961823,0.14307855069637299]|
|[man] |[-0.12674053013324738,0.09846510738134384,-0.10375533252954483] |
|[women]|[-0.1633371263742447,-0.14517612755298615,0.11354436725378036] |
+-------+----------------------------------------------------------------+
C(king)−C(queen)≈C(man)−C(woman)
平均
+---------------+---------------------------------------------------------------+
|text |result |
+---------------+---------------------------------------------------------------+
|[I, like, java]|[-0.10301876937349637,-0.05928801248470942,0.05098938445250193]|
|[I] |[-0.01903184875845909,-0.13106320798397064,0.14307551085948944]|
|[like] |[-0.16329725086688995,-0.14520978927612305,0.11358748376369476]|
|[java] |[-0.12672720849514008,0.09840895980596542,-0.1036948412656784] |
+---------------+---------------------------------------------------------------+