第四範式OpenMLDB: 拓展Spark源碼實現高性能Join

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"背景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark是目前最流行的分佈式大數據批處理框架,使用Spark可以輕易地實現上百G甚至T級別數據的SQL運算,例如單行特徵計算或者多表的Join拼接。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第四範式OpenMLDB是針對AI場景優化的機器學習開源數據庫項目,實現了數據與計算一致性的離線MPP場景和在線OLTP場景計算引擎。其實MPP引擎可基於Spark實現,並通過拓展Spark源碼實現數倍性能提升。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/2e\/42\/2ec3a663baaf5109d2f103268e8d1342.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark本身實現也非常高效,基於Antlr實現的了標準ANSI SQL的詞法解析、語法分析,還有在Catalyst模塊中實現大量SQL靜態優化,然後轉成分佈式RDD計算,底層數據結構是使用了Java Unsafe API來自定義內存分佈的UnsafeRow,還依賴Janino JIT編譯器爲計算方法動態生成優化後的JVM bytecode。但在拓展性上仍有改進空間,尤其針對機器學習計算場景的需求雖能滿足但不高效,本文以LastJoin爲例介紹OpenMLDB如何通過拓展Spark源碼來實現數倍甚至數十倍性能提升。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"機器學習場景LastJoin"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"LastJoin是一種AI場景引入的特殊拼表類型,是LeftJoin的變種,在滿足Join條件的前提下,左表的每一行只拼取右表符合一提交的最後一行。LastJoin的語義特性,可以保證拼表後輸出結果的行數與輸入的左表一致。在機器學習場景中就是維持了輸入的樣本表數量一致,不會因爲拼表等數據操作導致最終的樣本數量增加或者減少,這種方式對在線服務支持比較友好也更符合科學家建模需求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/0c\/a6\/0cafc7978d5dbe194dyye1a0ac5741a6.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"包含LastJoin功能的OpenMLDB項目代碼以Apache 2.0協議在Github中開源("},{"type":"link","attrs":{"href":"https:\/\/github.com\/4paradigm\/OpenMLDB","title":"","type":null},"content":[{"type":"text","text":"github.com\/4paradigm\/OpenMLDB"}]},{"type":"text","text":"),所有用戶都可放心使用。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"基於Spark的LastJoin實現"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於LastJoin類型並非ANSI SQL中的標準,因此在SparkSQL等主流計算平臺中都沒有實現,爲了實現類似功能用戶只能通過更底層的DataFrame或RDD等算子來實現。基於Spark算子實現LastJoin的思路是首先對左表添加索引列,然後使用標準LeftOuterJoin,最後對拼接結果進行reduce和去掉索引行,雖然可以實現LastJoin語義但性能還是有很大瓶頸。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相比於兼容SQL功能和語法,Spark的另一個特點是用戶可以通過map、reduce、groupby等接口和自定義UDF的方式來實現標準SQL所不支持的數值計算邏輯。但Join功能用戶卻無法通過DataFrame或者RDD API來拓展實現,因爲拼表的實現是在Spark Catalyst物理節點中實現的,涉及了shuffle後多個internal row的拼接,以及生成Java源碼字符串進行JIT的過程,而且根據不同的輸入表數據量,Spark內部會適時選擇BrocastHashJoin、SortMergeJoin或ShuffleHashJoin來實現,普通用戶無法用RDD API來拓展這些拼表實現算法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在OpenMLDB項目中可以查看完整的Spark LastJoin實現,Github代碼地址:"},{"type":"link","attrs":{"href":"https:\/\/github.com\/4paradigm\/OpenMLDB","title":"","type":null},"content":[{"type":"text","text":"github.com\/4paradigm\/OpenMLDB"}]},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一步是對輸入的左表進行索引列擴充,擴充方式有多種實現,只要添加的索引列每一行有unique id即可,下面是第一步的實現代碼。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"\/\/ Add the index column for Spark DataFrame\n def addIndexColumn(spark: SparkSession, df: DataFrame, indexColName: String, method: String): DataFrame = {\n logger.info(\"Add the indexColName(%s) to Spark DataFrame(%s)\".format(indexColName, df.toString()))\n \n method.toLowerCase() match {\n case \"zipwithuniqueid\" | \"zip_withunique_id\" => addColumnByZipWithUniqueId(spark, df, indexColName)\n case \"zipwithindex\" | \"zip_with_index\" => addColumnByZipWithIndex(spark, df, indexColName)\n case \"monotonicallyincreasingid\" | \"monotonically_increasing_id\" =>\n addColumnByMonotonicallyIncreasingId(spark, df, indexColName)\n case _ => throw new HybridSeException(\"Unsupported add index column method: \" + method)\n }\n \n }\n \n def addColumnByZipWithUniqueId(spark: SparkSession, df: DataFrame, indexColName: String = null): DataFrame = {\n logger.info(\"Use zipWithUniqueId to generate index column\")\n val indexedRDD = df.rdd.zipWithUniqueId().map {\n case (row, id) => Row.fromSeq(row.toSeq :+ id)\n }\n spark.createDataFrame(indexedRDD, df.schema.add(indexColName, LongType))\n }\n \n def addColumnByZipWithIndex(spark: SparkSession, df: DataFrame, indexColName: String = null): DataFrame = {\n logger.info(\"Use zipWithIndex to generate index column\")\n val indexedRDD = df.rdd.zipWithIndex().map {\n case (row, id) => Row.fromSeq(row.toSeq :+ id)\n }\n spark.createDataFrame(indexedRDD, df.schema.add(indexColName, LongType))\n }\n \n def addColumnByMonotonicallyIncreasingId(spark: SparkSession,\n df: DataFrame, indexColName: String = null): DataFrame = {\n logger.info(\"Use monotonicallyIncreasingId to generate index column\")\n df.withColumn(indexColName, monotonically_increasing_id())\n }\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二步是進行標準的LeftOuterJoin,由於OpenMLDB底層是基於C++實現,因此多個join condition的表達式都要轉成Spark表達式(封裝成Spark Column對象),然後調用Spark DataFrame的join函數即可,拼接類型使用“left”或者“left_outer\"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"val joined = leftDf.join(rightDf, joinConditions.reduce(_ && _), \"left\")\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三步是對拼接後的表進行reduce,因爲通過LeftOuterJoin有可能對輸入數據進行擴充,也就是1:N的變換,而所有新增的行都擁有第一步進行索引列拓展的unique id,因此針對unique id進行reduce即可,這裏使用Spark DataFrame的groupByKey和mapGroups接口(注意Spark 2.0以下不支持此API),同時如果有額外的排序字段還可以取得每個組的最大值或最小值。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"val distinct = joined\n .groupByKey {\n row => row.getLong(indexColIdx)\n }\n .mapGroups {\n case (_, iter) =>\n val timeExtractor = SparkRowUtil.createOrderKeyExtractor(\n timeIdxInJoined, timeColType, nullable=false)\n \n if (isAsc) {\n iter.maxBy(row => {\n if (row.isNullAt(timeIdxInJoined)) {\n Long.MinValue\n } else {\n timeExtractor.apply(row)\n }\n })\n } else {\n iter.minBy(row => {\n if (row.isNullAt(timeIdxInJoined)) {\n Long.MaxValue\n } else {\n timeExtractor.apply(row)\n }\n })\n }\n }(RowEncoder(joined.schema))\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後一步只是去掉索引列即可,通過預先指定的索引列名即可實現。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"distinct.drop(indexName)\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"總結一下基於Spark算子實現的LastJoin方案,這是目前基於Spark編程接口最高效的實現了,對於Spark 1.6等低版本還需要使用mapPartition等接口來實現類似mapGroups的功能。由於是基於LeftOuterJoin實現,因此LastJoin的這種實現比LeftOuterJoin還差,實際輸出的數據量反而是更少的,對於左表與右表有大量拼接條件能滿足的情況下,整體內存消耗量還是也是非常大的。因此下面介紹基於Spark源碼修改實現的原生LastJoin,可以避免上述問題。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"拓展Spark源碼的LastJoin實現"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原生LastJoin實現,是指直接在Spark源碼上實現的LastJoin功能,而不是基於Spark DataFrame和LeftOuterJoin來實現,在性能和內存消耗上有巨大的優化。OpenMLDB使用了定製優化的Spark distribution,其中依賴的Spark源碼也在Github中開源 (GitHub - 4paradigm\/spark at v3.0.0-openmldb) 。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要支持原生的LastJoin,首先在JoinType上就需要加上last語法,由於Spark基於Antlr實現的SQL語法解析也會直接把SQL join類型轉成JoinType,因此只需要修改JoinType.scala文件即可。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"object JoinType {\n def apply(typ: String): JoinType = typ.toLowerCase(Locale.ROOT).replace(\"_\", \"\") match {\n case \"inner\" => Inner\n case \"outer\" | \"full\" | \"fullouter\" => FullOuter\n case \"leftouter\" | \"left\" => LeftOuter\n \/\/ Add by 4Paradigm\n case \"last\" => LastJoinType\n case \"rightouter\" | \"right\" => RightOuter\n case \"leftsemi\" | \"semi\" => LeftSemi\n case \"leftanti\" | \"anti\" => LeftAnti\n case \"cross\" => Cross\n case _ =>\n val supported = Seq(\n \"inner\",\n \"outer\", \"full\", \"fullouter\", \"full_outer\",\n \"last\", \"leftouter\", \"left\", \"left_outer\",\n \"rightouter\", \"right\", \"right_outer\",\n \"leftsemi\", \"left_semi\", \"semi\",\n \"leftanti\", \"left_anti\", \"anti\",\n \"cross\")\n \n throw new IllegalArgumentException(s\"Unsupported join type '$typ'. \" +\n \"Supported join types include: \" + supported.mkString(\"'\", \"', '\", \"'\") + \".\")\n }\n}\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中LastJoinType類型的實現如下"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"\/\/ Add by 4Paradigm\ncase object LastJoinType extends JoinType {\n override def sql: String = \"LAST\"\n}\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在Spark源碼中,還有一些語法檢查類和優化器類都會檢查內部支持的join type,因此在Analyzer.scala、Optimizer.scala、basicLogicalOperators.scala、SparkStrategies.scala這幾個文件中都需要有簡單都修改,scala switch case支持都枚舉類型中增加對新join type的支持,這裏不一一贅述了,只要解析和運行時缺少對新枚舉類型支持就加上即可。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"\/\/ the output list looks like: join keys, columns from left, columns from right\nval projectList = joinType match {\n case LeftOuter =>\n leftKeys ++ lUniqueOutput ++ rUniqueOutput.map(_.withNullability(true))\n \/\/ Add by 4Paradigm\n case LastJoinType =>\n leftKeys ++ lUniqueOutput ++ rUniqueOutput.map(_.withNullability(true))\n case LeftExistence(_) =>\n leftKeys ++ lUniqueOutput\n case RightOuter =>\n rightKeys ++ lUniqueOutput.map(_.withNullability(true)) ++ rUniqueOutput\n case FullOuter =>\n \/\/ in full outer join, joinCols should be non-null if there is.\n val joinedCols = joinPairs.map { case (l, r) => Alias(Coalesce(Seq(l, r)), l.name)() }\n joinedCols ++\n lUniqueOutput.map(_.withNullability(true)) ++\n rUniqueOutput.map(_.withNullability(true))\n case _ : InnerLike =>\n leftKeys ++ lUniqueOutput ++ rUniqueOutput\n case _ =>\n sys.error(\"Unsupported natural join type \" + joinType)\n}\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"前面語法解析和數據結構支持新的join type後,重點就是來修改三種Spark join物理算子的實現代碼了。首先是右表比較小時Spark會自動優化成BrocastHashJoin,這時右表通過broadcast拷貝到所有executor的內存裏,遍歷右表可以找到所有符合join condiction的行,如果右表沒有符合條件則保留左表internal row並且右表字段值爲null,如果有一行或多行符合條件就合併兩個internal row到輸出internal row裏,代碼實現在BroadcastHashJoinExec.scala中。因爲新增了join type枚舉類型,因此我們修改這兩個方法來表示支持這種join type,並且通過參數來區分和之前join type的實現。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":" override def doConsume(ctx: CodegenContext, input: Seq[ExprCode], row: ExprCode): String = {\n joinType match {\n case _: InnerLike => codegenInner(ctx, input)\n case LeftOuter | RightOuter => codegenOuter(ctx, input)\n \/\/ Add by 4Paradigm\n case LastJoinType => codegenOuter(ctx, input, true)\n case LeftSemi => codegenSemi(ctx, input)\n case LeftAnti => codegenAnti(ctx, input)\n case j: ExistenceJoin => codegenExistence(ctx, input)\n case x =>\n throw new IllegalArgumentException(\n s\"BroadcastHashJoin should not take $x as the JoinType\")\n }\n }\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"BrocastHashJoin的核心實現代碼也是使用JIT來實現的,因此我們需要修改codegen成Java代碼字符串的邏輯,在codegenOuter函數中,保留原來LeftOuterJoin的實現,並且使用前面的參數來區分是否使用新的join type實現。這裏修改的邏輯也非常簡單,因爲新的join type只要保證右表有一行數據拼到後就返回,因此不需要通過while來遍歷右表候選集。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":" \n \/\/ Add by 4Paradigm\n if (isLastJoin) {\n s\"\"\"\n |\/\/ generate join key for stream side\n |${keyEv.code}\n |\/\/ find matches from HashRelation\n |$iteratorCls $matches = $anyNull ? null : ($iteratorCls)$relationTerm.get(${keyEv.value});\n |boolean $found = false;\n |\/\/ the last iteration of this loop is to emit an empty row if there is no matched rows.\n |if ($matches != null && $matches.hasNext() || !$found) {\n | UnsafeRow $matched = $matches != null && $matches.hasNext() ?\n | (UnsafeRow) $matches.next() : null;\n | ${checkCondition.trim}\n | if ($conditionPassed) {\n | $found = true;\n | $numOutput.add(1);\n | ${consume(ctx, resultVars)}\n | }\n |}\n \"\"\".stripMargin\n }\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然後是修改SortMergeJoin的實現來支持新的join type,如果右表比較大不能直接broacast那麼大概率會使用SortMergeJoin實現,實現原理和前面的修改類似,不一樣的是這裏不是通過JIT實現的,因此直接修改拼表的邏輯即可,保證只要有一行符合條件即可拼接並返回。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":" private def bufferMatchingRows(): Unit = {\n assert(streamedRowKey != null)\n assert(!streamedRowKey.anyNull)\n assert(bufferedRowKey != null)\n assert(!bufferedRowKey.anyNull)\n assert(keyOrdering.compare(streamedRowKey, bufferedRowKey) == 0)\n \/\/ This join key may have been produced by a mutable projection, so we need to make a copy:\n matchJoinKey = streamedRowKey.copy()\n bufferedMatches.clear()\n \n \/\/ Add by 4Paradigm\n if (isLastJoin) {\n bufferedMatches.add(bufferedRow.asInstanceOf[UnsafeRow])\n advancedBufferedToRowWithNullFreeJoinKey()\n } else {\n do {\n bufferedMatches.add(bufferedRow.asInstanceOf[UnsafeRow])\n advancedBufferedToRowWithNullFreeJoinKey()\n } while (bufferedRow != null && keyOrdering.compare(streamedRowKey, bufferedRowKey) == 0)\n }\n \n }\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後是ShuffleHashJoin的實現,對應的實現在子類HashJoin.scala中,原理與前面也類似,調用outerJoin函數遍歷stream table的時候,修改核心的遍歷邏輯,保證左表在拼不到時保留並添加null,在拼到一行時立即返回即可。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"private def outerJoin(\n streamedIter: Iterator[InternalRow],\n hashedRelation: HashedRelation,\n isLastJoin: Boolean = false): Iterator[InternalRow] = {\n val joinedRow = new JoinedRow()\n val keyGenerator = streamSideKeyGenerator()\n val nullRow = new GenericInternalRow(buildPlan.output.length)\n \n streamedIter.flatMap { currentRow =>\n val rowKey = keyGenerator(currentRow)\n joinedRow.withLeft(currentRow)\n val buildIter = hashedRelation.get(rowKey)\n new RowIterator {\n private var found = false\n override def advanceNext(): Boolean = {\n \n \/\/ Add by 4Paradigm to support last join\n if (isLastJoin && found) {\n return false\n }\n \n \/\/ Add by 4Paradigm to support last join\n if (isLastJoin) {\n if (buildIter != null && buildIter.hasNext) {\n val nextBuildRow = buildIter.next()\n if (boundCondition(joinedRow.withRight(nextBuildRow))) {\n found = true\n return true\n }\n }\n } else {\n while (buildIter != null && buildIter.hasNext) {\n val nextBuildRow = buildIter.next()\n if (boundCondition(joinedRow.withRight(nextBuildRow))) {\n found = true\n return true\n }\n }\n }\n \n if (!found) {\n joinedRow.withRight(nullRow)\n found = true\n return true\n }\n false\n }\n override def getRow: InternalRow = joinedRow\n }.toScala\n }\n }\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過對前面JoinType和三種Join物理節點的修改,用戶就可以像其他內置join type一樣,使用SQL或者DataFrame接口來做新的拼表邏輯了,拼表後保證輸出行數與左表一致,結果和最前面基於LeftOuterJoin + dropDuplicated的方案也是一樣的。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"LastJoin實現性能對比"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼既然實現的新的Join算法,我們就對比前面兩種方案的性能吧,前面直接基於最新的Spark 3.0開源版,不修改Spark優化器的情況下對於小數據會使用broadcast join進行性能優化,後者直接使用修改Spark源碼編譯後的版本,在小數據下Spark也會優化成broadcast join實現。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先是測試join condiction能拼接多行的情況,對於LeftOuterJoin由於能拼接多行,因此第一個階段使用LeftOuterJoin輸出的表會大很多,第二階段dropDuplication也會更耗時,而LastJoin因爲在shuffle時拼接到單行就返回了,因此不會因爲拼接多行導致性能下降。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/5a\/d0\/5a73ef762f6374cfbbb2c24291c5f9d0.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從結果上看性能差異也很明顯,由於右表數據量都比較小,因此這三組數據Spark都會優化成broadcast join的實現,由於LeftOuterJoin會拼接多行,因此性能就比新的LastJoin慢很多,當數據量增大時LeftOuterJoin拼接的結果表數據量更加爆炸,性能成指數級下降,與LastJoin有數十倍到數百倍的差異,最後還可能因爲OOM導致失敗,而LastJoin不會因爲數據量增大有明顯的性能下降。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"右表能拼接多行對LeftOuterJoin + dropDupilicated方案多少有些不公平,因此我們新增一個測試場景,拼接時保證左表只可能與右表的一行拼接成功,這樣無論是LeftOuterJoin還是LastJoin結果都是一模一樣的,這種場景下性能對比更有意義。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/8y\/20\/8yy82e3890bcb2442faa927f7e973020.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從結果上看性能差異已經沒有那麼明顯了,但LastJoin還是會比前者方案快接近一倍,前面兩組右表數據量比較小被Spark優化成broadcast join實現,最後一組沒有優化會使用sorge merge join實現。從BroadcastHashJoin和SortMergeJoin最終生成的代碼可以看到,如果右表只有一行拼接成功的話,LeftOuterJoin和LastJoin的實現邏輯基本是一模一樣的,那麼性能差異主要在於前者方案還需要進行一次dropDuplicated計算,這個stage雖然計算複雜度不高但在小數據規模下耗時佔比還是比較大,無論是哪種測試方案在這種特殊的拼表場景下修改Spark源碼還是性能最優的實現方案。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"技術總結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後簡單總結下,OpenMLDB項目通過理解和修改Spark源碼,可以根據業務場景來實現新的拼表算法邏輯,從性能上看比使用原生Spark接口實現性能可以有巨大的提升。Spark源碼涉及SQL語法解析、Catalyst邏輯計劃優化、JIT代碼動態編譯等,擁有這些基礎後可以對Spark功能和性能進行更底層的拓展。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章