Spark MLlib LinearRegression線性迴歸算法源碼解析

線性迴歸

  • 一元線性迴歸
    • hθ(x)=θ0+θ1x ——————–1
  • 多元線性迴歸
    • hθ(x)=i=1mθixi=θTX —————–2
  • 損失函數

    • J(θ)=1/2i=1m(hθ(xi)yi)2 —————3
    • 1/2 是爲了求導時係數爲1,平方里是真實值減去估計值
    • 我們的目的就是求其最小值
  • 最小二乘法要求較爲苛刻,求矩陣的逆比較慢,要求 X 是滿秩

梯度下降法

  • 梯度下降法
    • 我們的目的是爲了求 J(θ) 的極小值
    • 梯度方向有J(θ)θ 的偏導數確定,由於求的是極小值,因此梯度方向是偏導數的反方向,如果在區間內是凸函數,就是說梯度在更新 θ 時,在梯度是負數時增加θ
    • θj:=θjαθjJ(θ) ——————-4
    • 其中α 爲學習速率,過大容易越過最小值,過小容易造成迭代次數過多,收斂變慢
    • 如果只有一條樣本
    • θjJ(θ)=(hθ(x)y)θj(i=0mθixiy)=(hθ(x)y)xj —————–5
    • 當數量不唯一時,將(5)帶入(3)求偏導那麼每個參數θj 沿梯度方向變化如下:
    • θj:=θj+αi=0m(yihθ(xi))xji ——————————6
    • 這裏m 代表的時所有樣本
  • 隨機梯度下降
    • 梯度下降法會訓練所有樣本然後更新梯度,每次迭代複雜度O(mn),樣本很大我們採用隨機梯度下降
    • 每讀取一條樣本,對θT 進行更新
    • 雖然速度快但是會在最小值(極小?)附近震盪,造成永遠不能收斂
    • 爲減少計算複雜度,對於θT 的變化程度設置一個閾值

源碼分析

  • MLlib源碼分析

    • 建立線性迴歸

    • org/apache/spark/mllib/regression/LinearRegression.scala

    object LinearRegressionWithSGD //是LogisticRegressionWithSGD的伴生對象(可以理解爲單例)
    //主要定義train方法,傳遞訓練參數和RDD給run方法,是LogisticRegressionWithSGD類的入口
    
    //基於隨機梯度下降法的線性迴歸模型
    //損失函數f(weights) = 1/n ||A weights-y||^2^
    class LinearRegressionWithSGD private[mllib] (//在mllib包下的
        private var stepSize: Double,//迭代步長
        private var numIterations: Int,//迭代次數
        private var regParam: Double,//正則化參數
        private var miniBatchFraction: Double)//迭代參與樣本的比例
      extends GeneralizedLinearAlgorithm[LogisticRegressionModel] with Serializable
      //採用最小平方損失函數的梯度下降法
      private val gradient = new LeastSquaresGradient()
      //採用簡單梯度更新,無正則化
      private val updater = new SimpleUpdater()
      @Since("0.8.0")
      //新建梯度優化計算方法
      override val optimizer = new GradientDescent(gradient, updater)
        .setStepSize(stepSize)
        .setNumIterations(numIterations)
        .setRegParam(regParam)
        .setMiniBatchFraction(miniBatchFraction)
    • 訓練的run方法

    • org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala

      /**
       * Run the algorithm with the configured parameters on an input RDD
       * of LabeledPoint entries starting from the initial weights provided.
       *
       */
      @Since("1.0.0")
      def run(input: RDD[LabeledPoint], initialWeights: Vector): M = {
        //特徵維度
        //求RDD中features的數量
        if (numFeatures < 0) {
          numFeatures = input.map(_.features.size).first()
        }
        //輸入樣本檢測
        if (input.getStorageLevel == StorageLevel.NONE) {
          logWarning("The input data is not directly cached, which may hurt performance if its"
            + " parent RDDs are also uncached.")
        }
    
        // Check the data properties before running the optimizer
        if (validateData && !validators.forall(func => func(input))) {
          throw new SparkException("Input validation failed.")
        }
    
        /**
            數據降維,優化過程中,收斂取決於訓練數據的維度,降維提高收斂速度
            目前只適用於邏輯迴歸
         */
        val scaler = if (useFeatureScaling) {
          new StandardScaler(withStd = true, withMean = false).fit(input.map(_.features))
        } else {
          null
        }
    
        // Prepend an extra variable consisting of all 1.0's for the intercept.
        //增加偏置項bias:θ0常數項
        // TODO: Apply feature scaling to the weight vector instead of input data.
        val data =
          if (addIntercept) {
            if (useFeatureScaling) {
              input.map(lp => (lp.label, appendBias(scaler.transform(lp.features)))).cache()
            } else {
              //在feature後添加bias,持久化到內存
              input.map(lp => (lp.label, appendBias(lp.features))).cache()
            }
          } else {
            if (useFeatureScaling) {
              input.map(lp => (lp.label, scaler.transform(lp.features))).cache()
            } else {
              input.map(lp => (lp.label, lp.features))
            }
          }
    
        /**
         * TODO: For better convergence, in logistic regression, the intercepts should be computed
         * from the prior probability distribution of the outcomes; for linear regression,
         * the intercept should be set as the average of response.
         * 初始權重,包括增加偏置項
         */
        val initialWeightsWithIntercept = if (addIntercept && numOfLinearPredictor == 1) {
          appendBias(initialWeights)
        } else {
          /** If `numOfLinearPredictor > 1`, initialWeights already contains intercepts. */
          initialWeights
        }
        //權重訓練優化,進行梯度下降學習,返回最優權重
        val weightsWithIntercept = optimizer.optimize(data, initialWeightsWithIntercept)
        //獲取偏置項
        val intercept = if (addIntercept && numOfLinearPredictor == 1) {
          weightsWithIntercept(weightsWithIntercept.size - 1)
        } else {
          0.0
        }
        //獲取權重
        var weights = if (addIntercept && numOfLinearPredictor == 1) {
          Vectors.dense(weightsWithIntercept.toArray.slice(0, weightsWithIntercept.size - 1))
        } else {
          weightsWithIntercept
        }
    
        /**
         * The weights and intercept are trained in the scaled space; we're converting them back to
         * the original scale.
         * 對於降維,權重要進行還原
         * Math shows that if we only perform standardization without subtracting means, the intercept
         * will not be changed. w_i = w_i' / v_i where w_i' is the coefficient in the scaled space, w_i
         * is the coefficient in the original space, and v_i is the variance of the column i.
         */
        if (useFeatureScaling) {
          if (numOfLinearPredictor == 1) {
            weights = scaler.transform(weights)
          } else {
            /**
             * For `numOfLinearPredictor > 1`, we have to transform the weights back to the original
             * scale for each set of linear predictor. Note that the intercepts have to be explicitly
             * excluded when `addIntercept == true` since the intercepts are part of weights now.
             */
            var i = 0
            val n = weights.size / numOfLinearPredictor
            val weightsArray = weights.toArray
            while (i < numOfLinearPredictor) {
              val start = i * n
              val end = (i + 1) * n - { if (addIntercept) 1 else 0 }
    
              val partialWeightsArray = scaler.transform(
                Vectors.dense(weightsArray.slice(start, end))).toArray
    
              System.arraycopy(partialWeightsArray, 0, weightsArray, start, partialWeightsArray.length)
              i += 1
            }
            weights = Vectors.dense(weightsArray)
          }
        }
    
        // Warn at the end of the run as well, for increased visibility.
        if (input.getStorageLevel == StorageLevel.NONE) {
          logWarning("The input data was not directly cached, which may hurt performance if its"
            + " parent RDDs are also uncached.")
        }
    
        // Unpersist cached data
        if (data.getStorageLevel != StorageLevel.NONE) {
          data.unpersist(false)
        }
        //返回訓練模型
        createModel(weights, intercept)
      }
    }
    • 梯度下降求解權重

    • org/apache/spark/mllib/optimization/GradientDescent.scala

    • optimizer.optimize->GradientDescent.optimize->GradientDescent.runMiniBatchSGD

      def runMiniBatchSGD(
          data: RDD[(Double, Vector)],
          gradient: Gradient,
          updater: Updater,
          stepSize: Double,
          numIterations: Int,
          regParam: Double,
          miniBatchFraction: Double,
          initialWeights: Vector,
          convergenceTol: Double): (Vector, Array[Double]) = {
        //歷史迭代誤差數組
        val stochasticLossHistory = new ArrayBuffer[Double](numIterations)
        // Record previous weight and current one to calculate solution vector difference
    
        var previousWeights: Option[Vector] = None
        var currentWeights: Option[Vector] = None
    
        //訓練樣本數m
        val numExamples = data.count()
    
        // if no data, return initial weights to avoid NaNs
        if (numExamples == 0) {
          logWarning("GradientDescent.runMiniBatchSGD returning initial weights, no data found")
          return (initialWeights, stochasticLossHistory.toArray)
        }
    
        if (numExamples * miniBatchFraction < 1) {
          logWarning("The miniBatchFraction is too small")
        }
    
        // 初始化權重Initialize weights as a column vector
        var weights = Vectors.dense(initialWeights.toArray)
        val n = weights.size
    
        /**
         * For the first iteration, the regVal will be initialized as sum of weight squares
         * if it's L2 updater; for L1 updater, the same logic is followed.
         */
        //這裏的compute進行了第一次迭代,正則化值初始化爲權重的加權平方和
        var regVal = updater.compute(
          weights, Vectors.zeros(weights.size), 0, 1, regParam)._2
    
        var converged = false // indicates whether converged based on convergenceTol
        var i = 1
        //權重迭代計算
        while (!converged && i <= numIterations) {
          //廣播權重變量,每個Executer都接受到當前權重
          val bcWeights = data.context.broadcast(weights)
          // Sample a subset (fraction miniBatchFraction) of the total data
          // 隨機抽樣樣本,對抽樣的樣本子集,採用treeAggregate的RDD方法,進行聚合計算
          //計算每個樣本的權重向量,誤差值,然後對所有樣本權重向量和誤差值進行累加,這是一次map-reduce
          // compute and sum up the subgradients on this subset (this is one map-reduce)
          val (gradientSum, lossSum, miniBatchSize) = data.sample(false, miniBatchFraction, 42 + i)
            .treeAggregate((BDV.zeros[Double](n), 0.0, 0L))(//初始值
              //將v合併到c
              seqOp = (c, v) => {
                // c: (grad, loss, count), v: (label, features)
                //返回的是loss
                val l = gradient.compute(v._2, v._1, bcWeights.value, Vectors.fromBreeze(c._1))
                (c._1, c._2 + l, c._3 + 1)
              },
              //合併兩個c
              combOp = (c1, c2) => {
                // c: (grad, loss, count)
                //
                (c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3)
              })
          bcWeights.destroy(blocking = false)
    
          if (miniBatchSize > 0) {
            /**
             * lossSum is computed using the weights from the previous iteration
             * and regVal is the regularization value computed in the previous iteration as well.
             */
            //保存誤差,更新權重
            stochasticLossHistory += lossSum / miniBatchSize + regVal
            val update = updater.compute(
              weights, Vectors.fromBreeze(gradientSum / miniBatchSize.toDouble),
              stepSize, i, regParam)
            weights = update._1
            regVal = update._2
    
            previousWeights = currentWeights
            currentWeights = Some(weights)
            if (previousWeights != None && currentWeights != None) {
              converged = isConverged(previousWeights.get,
                currentWeights.get, convergenceTol)
            }
          } else {
            logWarning(s"Iteration ($i/$numIterations). The size of sampled batch is zero")
          }
          i += 1
        }
    
        logInfo("GradientDescent.runMiniBatchSGD finished. Last 10 stochastic losses %s".format(
          stochasticLossHistory.takeRight(10).mkString(", ")))
        //返回模型和誤差數組
        (weights, stochasticLossHistory.toArray)
    
      }
    
      //判斷是否收斂
      private def isConverged(
          previousWeights: Vector,
          currentWeights: Vector,
          convergenceTol: Double): Boolean = {
        // To compare with convergence tolerance.
        val previousBDV = previousWeights.asBreeze.toDenseVector
        val currentBDV = currentWeights.asBreeze.toDenseVector
    
        // This represents the difference of updated weights in the iteration.
        val solutionVecDiff: Double = norm(previousBDV - currentBDV)
    
        solutionVecDiff < convergenceTol * Math.max(norm(currentBDV), 1.0)
      }
    
    }
    • 梯度計算

    • gradient.compute計算每個樣本的梯度和誤差,線性迴歸中使用LeastSquaresGradient。最小二乘

    • org/apache/spark/mllib/optimization/Gradient.scala

    /**
     * :: DeveloperApi ::
     * Compute gradient and loss for a Least-squared loss function, as used in linear regression.
     * This is correct for the averaged least squares loss function (mean squared error)
     *              L = 1/2n ||A weights-y||^2
     * See also the documentation for the precise formulation.
     */
    @DeveloperApi
    class LeastSquaresGradient extends Gradient {
      override def compute(data: Vector, label: Double, weights: Vector): (Vector, Double) = {
        //y-h0(x)
        val diff = dot(data, weights) - label
    
        val loss = diff * diff / 2.0
        val gradient = data.copy
        scal(diff, gradient) //梯度值:x*(y-h0(x))
        (gradient, loss)
      }
    
      override def compute(
          data: Vector,
          label: Double,
          weights: Vector,
          cumGradient: Vector): Double = {
        val diff = dot(data, weights) - label //y-h0(x)
        axpy(diff, data, cumGradient) // y= x* (y-h0(x))+cumGradient
        diff * diff / 2.0
      }
    }
    • 權重更新

    • /**
      * :: DeveloperApi ::
      * A simple updater for gradient descent *without* any regularization.
      * Uses a step-size decreasing with the square root of the number of iterations.
      */
      @DeveloperApi
      class SimpleUpdater extends Updater {
      override def compute(
        weightsOld: Vector,
        gradient: Vector,
        stepSize: Double,
        iter: Int,
        regParam: Double): (Vector, Double) = {
      //當前迭代次數的平方根的倒數作爲趨近的因子α
      val thisIterStepSize = stepSize / math.sqrt(iter)
      val brzWeights: BV[Double] = weightsOld.asBreeze.toDenseVector
      brzAxpy(-thisIterStepSize, gradient.asBreeze, brzWeights)
      
      (Vectors.fromBreeze(brzWeights), 0)
      }
      }
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章