原理
https://blog.csdn.net/u011955252/article/details/50503408
對數據進行hash,由海量數據操作變成少量數據
代碼
val dfA = spark.createDataFrame(Seq(
(0, Vectors.dense(1.0, 1.0)),
(1, Vectors.dense(1.0, -1.0)),
(2, Vectors.dense(-1.0, -1.0)),
(3, Vectors.dense(-1.0, 1.0)),
(4, Vectors.dense(-1.0, 1.1)),
(5, Vectors.dense(-1.0, 2)),
(6, Vectors.dense(2, 22)),
(7, Vectors.dense(11, 44)),
(8, Vectors.dense(55, 33)),
(9, Vectors.dense(33, 22))
)).toDF("id", "features")
val key = Vectors.dense(-1.0, 1.0)
val brp = new BucketedRandomProjectionLSH()
.setBucketLength(10)
.setNumHashTables(4)
.setInputCol("features")
.setOutputCol("hashes")
val model = brp.fit(dfA)
// Feature Transformation
println("The hashed dataset where hashed values are stored in the column 'hashes':")
model.transform(dfA).show(false)
// Compute the locality sensitive hashes for the input rows, then perform approximate nearest
// neighbor search.
// We could avoid computing hashes by passing in the already-transformed dataset, e.g.
// `model.approxNearestNeighbors(transformedA, key, 2)`
println("Approximately searching dfA for 2 nearest neighbors of the key:")
model.approxNearestNeighbors(dfA, key, 4).show(false)
+---+-----------+-------------------------------+
|id |features |hashes |
+---+-----------+-------------------------------+
|0 |[1.0,1.0] |[[0.0], [0.0], [-1.0], [-1.0]] |
|1 |[1.0,-1.0] |[[-1.0], [-1.0], [0.0], [-1.0]]|
|2 |[-1.0,-1.0]|[[-1.0], [-1.0], [0.0], [0.0]] |
|3 |[-1.0,1.0] |[[0.0], [0.0], [-1.0], [0.0]] |
|4 |[-1.0,1.1] |[[0.0], [0.0], [-1.0], [0.0]] |
|5 |[-1.0,2.0] |[[0.0], [0.0], [-1.0], [0.0]] |
|6 |[2.0,22.0] |[[2.0], [1.0], [-2.0], [-1.0]] |
|7 |[11.0,44.0]|[[4.0], [3.0], [-5.0], [-2.0]] |
|8 |[55.0,33.0]|[[4.0], [0.0], [-6.0], [-6.0]] |
|9 |[33.0,22.0]|[[2.0], [0.0], [-4.0], [-4.0]] |
+---+-----------+-------------------------------+
+---+----------+------------------------------+-------------------+
|id |features |hashes |distCol |
+---+----------+------------------------------+-------------------+
|3 |[-1.0,1.0]|[[0.0], [0.0], [-1.0], [0.0]] |0.0 |
|4 |[-1.0,1.1]|[[0.0], [0.0], [-1.0], [0.0]] |0.10000000000000009|
|5 |[-1.0,2.0]|[[0.0], [0.0], [-1.0], [0.0]] |1.0 |
|0 |[1.0,1.0] |[[0.0], [0.0], [-1.0], [-1.0]]|2.0 |
+---+----------+------------------------------+-------------------+
setBucketLength 函數設置桶的長度,該值越大相同數據進入到同一個桶的概率越高
setNumHashTables 通過幾個hash function對數據進行hash操作
model.approxNearestNeighbors(dfA, key, 4).show(false)
查找與要查數據相似度最高的幾條數據