LSH Algorithms 1:Bucketed Random Projection for Euclidean Distance

LSH Algorithms 1:Bucketed Random Projection for Euclidean Distance

原理

https://blog.csdn.net/u011955252/article/details/50503408
對數據進行hash,由海量數據操作變成少量數據

代碼

val dfA = spark.createDataFrame(Seq(
      (0, Vectors.dense(1.0, 1.0)),
      (1, Vectors.dense(1.0, -1.0)),
      (2, Vectors.dense(-1.0, -1.0)),
      (3, Vectors.dense(-1.0, 1.0)),
      (4, Vectors.dense(-1.0, 1.1)),
      (5, Vectors.dense(-1.0, 2)),
      (6, Vectors.dense(2, 22)),
      (7, Vectors.dense(11, 44)),
      (8, Vectors.dense(55, 33)),
      (9, Vectors.dense(33, 22))
    )).toDF("id", "features")    
    val key = Vectors.dense(-1.0, 1.0)
    
    val brp = new BucketedRandomProjectionLSH()
      .setBucketLength(10)
      .setNumHashTables(4)
      .setInputCol("features")
      .setOutputCol("hashes")
    
    val model = brp.fit(dfA)
    
    // Feature Transformation
    println("The hashed dataset where hashed values are stored in the column 'hashes':")
    model.transform(dfA).show(false)
    

    // Compute the locality sensitive hashes for the input rows, then perform approximate nearest
    // neighbor search.
    // We could avoid computing hashes by passing in the already-transformed dataset, e.g.
    // `model.approxNearestNeighbors(transformedA, key, 2)`
    println("Approximately searching dfA for 2 nearest neighbors of the key:")
    model.approxNearestNeighbors(dfA, key, 4).show(false)
+---+-----------+-------------------------------+
|id |features   |hashes                         |
+---+-----------+-------------------------------+
|0  |[1.0,1.0]  |[[0.0], [0.0], [-1.0], [-1.0]] |
|1  |[1.0,-1.0] |[[-1.0], [-1.0], [0.0], [-1.0]]|
|2  |[-1.0,-1.0]|[[-1.0], [-1.0], [0.0], [0.0]] |
|3  |[-1.0,1.0] |[[0.0], [0.0], [-1.0], [0.0]]  |
|4  |[-1.0,1.1] |[[0.0], [0.0], [-1.0], [0.0]]  |
|5  |[-1.0,2.0] |[[0.0], [0.0], [-1.0], [0.0]]  |
|6  |[2.0,22.0] |[[2.0], [1.0], [-2.0], [-1.0]] |
|7  |[11.0,44.0]|[[4.0], [3.0], [-5.0], [-2.0]] |
|8  |[55.0,33.0]|[[4.0], [0.0], [-6.0], [-6.0]] |
|9  |[33.0,22.0]|[[2.0], [0.0], [-4.0], [-4.0]] |
+---+-----------+-------------------------------+
+---+----------+------------------------------+-------------------+
|id |features  |hashes                        |distCol            |
+---+----------+------------------------------+-------------------+
|3  |[-1.0,1.0]|[[0.0], [0.0], [-1.0], [0.0]] |0.0                |
|4  |[-1.0,1.1]|[[0.0], [0.0], [-1.0], [0.0]] |0.10000000000000009|
|5  |[-1.0,2.0]|[[0.0], [0.0], [-1.0], [0.0]] |1.0                |
|0  |[1.0,1.0] |[[0.0], [0.0], [-1.0], [-1.0]]|2.0                |
+---+----------+------------------------------+-------------------+

setBucketLength 函數設置桶的長度,該值越大相同數據進入到同一個桶的概率越高
setNumHashTables 通過幾個hash function對數據進行hash操作

model.approxNearestNeighbors(dfA, key, 4).show(false)
查找與要查數據相似度最高的幾條數據

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章