第四範式OpenMLDB: 拓展Spark源碼實現高性能Join

原創

第四范式技术团队

2021-09-18 17:23

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"背景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark是目前最流行的分佈式大數據批處理框架，使用Spark可以輕易地實現上百G甚至T級別數據的SQL運算，例如單行特徵計算或者多表的Join拼接。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第四範式OpenMLDB是針對AI場景優化的機器學習開源數據庫項目，實現了數據與計算一致性的離線MPP場景和在線OLTP場景計算引擎。其實MPP引擎可基於Spark實現，並通過拓展Spark源碼實現數倍性能提升。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/2e\/42\/2ec3a663baaf5109d2f103268e8d1342.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark本身實現也非常高效，基於Antlr實現的了標準ANSI SQL的詞法解析、語法分析，還有在Catalyst模塊中實現大量SQL靜態優化，然後轉成分佈式RDD計算，底層數據結構是使用了Java Unsafe API來自定義內存分佈的UnsafeRow，還依賴Janino JIT編譯器爲計算方法動態生成優化後的JVM bytecode。但在拓展性上仍有改進空間，尤其針對機器學習計算場景的需求雖能滿足但不高效，本文以LastJoin爲例介紹OpenMLDB如何通過拓展Spark源碼來實現數倍甚至數十倍性能提升。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"機器學習場景LastJoin"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"LastJoin是一種AI場景引入的特殊拼表類型，是LeftJoin的變種，在滿足Join條件的前提下，左表的每一行只拼取右表符合一提交的最後一行。LastJoin的語義特性，可以保證拼表後輸出結果的行數與輸入的左表一致。在機器學習場景中就是維持了輸入的樣本表數量一致，不會因爲拼表等數據操作導致最終的樣本數量增加或者減少，這種方式對在線服務支持比較友好也更符合科學家建模需求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/0c\/a6\/0cafc7978d5dbe194dyye1a0ac5741a6.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"包含LastJoin功能的OpenMLDB項目代碼以Apache 2.0協議在Github中開源（"},{"type":"link","attrs":{"href":"https:\/\/github.com\/4paradigm\/OpenMLDB","title":"","type":null},"content":[{"type":"text","text":"github.com\/4paradigm\/OpenMLDB"}]},{"type":"text","text":"），所有用戶都可放心使用。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"基於Spark的LastJoin實現"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於LastJoin類型並非ANSI SQL中的標準，因此在SparkSQL等主流計算平臺中都沒有實現，爲了實現類似功能用戶只能通過更底層的DataFrame或RDD等算子來實現。基於Spark算子實現LastJoin的思路是首先對左表添加索引列，然後使用標準LeftOuterJoin，最後對拼接結果進行reduce和去掉索引行，雖然可以實現LastJoin語義但性能還是有很大瓶頸。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相比於兼容SQL功能和語法，Spark的另一個特點是用戶可以通過map、reduce、groupby等接口和自定義UDF的方式來實現標準SQL所不支持的數值計算邏輯。但Join功能用戶卻無法通過DataFrame或者RDD API來拓展實現，因爲拼表的實現是在Spark Catalyst物理節點中實現的，涉及了shuffle後多個internal row的拼接，以及生成Java源碼字符串進行JIT的過程，而且根據不同的輸入表數據量，Spark內部會適時選擇BrocastHashJoin、SortMergeJoin或ShuffleHashJoin來實現，普通用戶無法用RDD API來拓展這些拼表實現算法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在OpenMLDB項目中可以查看完整的Spark LastJoin實現，Github代碼地址："},{"type":"link","attrs":{"href":"https:\/\/github.com\/4paradigm\/OpenMLDB","title":"","type":null},"content":[{"type":"text","text":"github.com\/4paradigm\/OpenMLDB"}]},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一步是對輸入的左表進行索引列擴充，擴充方式有多種實現，只要添加的索引列每一行有unique id即可，下面是第一步的實現代碼。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"\/\/ Add the index column for Spark DataFrame\n def addIndexColumn(spark: SparkSession, df: DataFrame, indexColName: String, method: String): DataFrame = {\n logger.info(\"Add the indexColName(%s) to Spark DataFrame(%s)\".format(indexColName, df.toString()))\n \n method.toLowerCase() match {\n case \"zipwithuniqueid\" | \"zip_withunique_id\" => addColumnByZipWithUniqueId(spark, df, indexColName)\n case \"zipwithindex\" | \"zip_with_index\" => addColumnByZipWithIndex(spark, df, indexColName)\n case \"monotonicallyincreasingid\" | \"monotonically_increasing_id\" =>\n addColumnByMonotonicallyIncreasingId(spark, df, indexColName)\n case _ => throw new HybridSeException(\"Unsupported add index column method: \" + method)\n }\n \n }\n \n def addColumnByZipWithUniqueId(spark: SparkSession, df: DataFrame, indexColName: String = null): DataFrame = {\n logger.info(\"Use zipWithUniqueId to generate index column\")\n val indexedRDD = df.rdd.zipWithUniqueId().map {\n case (row, id) => Row.fromSeq(row.toSeq :+ id)\n }\n spark.createDataFrame(indexedRDD, df.schema.add(indexColName, LongType))\n }\n \n def addColumnByZipWithIndex(spark: SparkSession, df: DataFrame, indexColName: String = null): DataFrame = {\n logger.info(\"Use zipWithIndex to generate index column\")\n val indexedRDD = df.rdd.zipWithIndex().map {\n case (row, id) => Row.fromSeq(row.toSeq :+ id)\n }\n spark.createDataFrame(indexedRDD, df.schema.add(indexColName, LongType))\n }\n \n def addColumnByMonotonicallyIncreasingId(spark: SparkSession,\n df: DataFrame, indexColName: String = null): DataFrame = {\n logger.info(\"Use monotonicallyIncreasingId to generate index column\")\n df.withColumn(indexColName, monotonically_increasing_id())\n }\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二步是進行標準的LeftOuterJoin，由於OpenMLDB底層是基於C++實現，因此多個join condition的表達式都要轉成Spark表達式（封裝成Spark Column對象），然後調用Spark DataFrame的join函數即可，拼接類型使用“left”或者“left_outer\"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"val joined = leftDf.join(rightDf, joinConditions.reduce(_ && _), \"left\")\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三步是對拼接後的表進行reduce，因爲通過LeftOuterJoin有可能對輸入數據進行擴充，也就是1:N的變換，而所有新增的行都擁有第一步進行索引列拓展的unique id，因此針對unique id進行reduce即可，這裏使用Spark DataFrame的groupByKey和mapGroups接口（注意Spark 2.0以下不支持此API），同時如果有額外的排序字段還可以取得每個組的最大值或最小值。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"val distinct = joined\n .groupByKey {\n row => row.getLong(indexColIdx)\n }\n .mapGroups {\n case (_, iter) =>\n val timeExtractor = SparkRowUtil.createOrderKeyExtractor(\n timeIdxInJoined, timeColType, nullable=false)\n \n if (isAsc) {\n iter.maxBy(row => {\n if (row.isNullAt(timeIdxInJoined)) {\n Long.MinValue\n } else {\n timeExtractor.apply(row)\n }\n })\n } else {\n iter.minBy(row => {\n if (row.isNullAt(timeIdxInJoined)) {\n Long.MaxValue\n } else {\n timeExtractor.apply(row)\n }\n })\n }\n }(RowEncoder(joined.schema))\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後一步只是去掉索引列即可，通過預先指定的索引列名即可實現。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"distinct.drop(indexName)\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"總結一下基於Spark算子實現的LastJoin方案，這是目前基於Spark編程接口最高效的實現了，對於Spark 1.6等低版本還需要使用mapPartition等接口來實現類似mapGroups的功能。由於是基於LeftOuterJoin實現，因此LastJoin的這種實現比LeftOuterJoin還差，實際輸出的數據量反而是更少的，對於左表與右表有大量拼接條件能滿足的情況下，整體內存消耗量還是也是非常大的。因此下面介紹基於Spark源碼修改實現的原生LastJoin，可以避免上述問題。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"拓展Spark源碼的LastJoin實現"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原生LastJoin實現，是指直接在Spark源碼上實現的LastJoin功能，而不是基於Spark DataFrame和LeftOuterJoin來實現，在性能和內存消耗上有巨大的優化。OpenMLDB使用了定製優化的Spark distribution，其中依賴的Spark源碼也在Github中開源（GitHub - 4paradigm\/spark at v3.0.0-openmldb）。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要支持原生的LastJoin，首先在JoinType上就需要加上last語法，由於Spark基於Antlr實現的SQL語法解析也會直接把SQL join類型轉成JoinType，因此只需要修改JoinType.scala文件即可。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"object JoinType {\n def apply(typ: String): JoinType = typ.toLowerCase(Locale.ROOT).replace(\"_\", \"\") match {\n case \"inner\" => Inner\n case \"outer\" | \"full\" | \"fullouter\" => FullOuter\n case \"leftouter\" | \"left\" => LeftOuter\n \/\/ Add by 4Paradigm\n case \"last\" => LastJoinType\n case \"rightouter\" | \"right\" => RightOuter\n case \"leftsemi\" | \"semi\" => LeftSemi\n case \"leftanti\" | \"anti\" => LeftAnti\n case \"cross\" => Cross\n case _ =>\n val supported = Seq(\n \"inner\",\n \"outer\", \"full\", \"fullouter\", \"full_outer\",\n \"last\", \"leftouter\", \"left\", \"left_outer\",\n \"rightouter\", \"right\", \"right_outer\",\n \"leftsemi\", \"left_semi\", \"semi\",\n \"leftanti\", \"left_anti\", \"anti\",\n \"cross\")\n \n throw new IllegalArgumentException(s\"Unsupported join type '$typ'. \" +\n \"Supported join types include: \" + supported.mkString(\"'\", \"', '\", \"'\") + \".\")\n }\n}\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中LastJoinType類型的實現如下"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"\/\/ Add by 4Paradigm\ncase object LastJoinType extends JoinType {\n override def sql: String = \"LAST\"\n}\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在Spark源碼中，還有一些語法檢查類和優化器類都會檢查內部支持的join type，因此在Analyzer.scala、Optimizer.scala、basicLogicalOperators.scala、SparkStrategies.scala這幾個文件中都需要有簡單都修改，scala switch case支持都枚舉類型中增加對新join type的支持，這裏不一一贅述了，只要解析和運行時缺少對新枚舉類型支持就加上即可。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"\/\/ the output list looks like: join keys, columns from left, columns from right\nval projectList = joinType match {\n case LeftOuter =>\n leftKeys ++ lUniqueOutput ++ rUniqueOutput.map(_.withNullability(true))\n \/\/ Add by 4Paradigm\n case LastJoinType =>\n leftKeys ++ lUniqueOutput ++ rUniqueOutput.map(_.withNullability(true))\n case LeftExistence(_) =>\n leftKeys ++ lUniqueOutput\n case RightOuter =>\n rightKeys ++ lUniqueOutput.map(_.withNullability(true)) ++ rUniqueOutput\n case FullOuter =>\n \/\/ in full outer join, joinCols should be non-null if there is.\n val joinedCols = joinPairs.map { case (l, r) => Alias(Coalesce(Seq(l, r)), l.name)() }\n joinedCols ++\n lUniqueOutput.map(_.withNullability(true)) ++\n rUniqueOutput.map(_.withNullability(true))\n case _ : InnerLike =>\n leftKeys ++ lUniqueOutput ++ rUniqueOutput\n case _ =>\n sys.error(\"Unsupported natural join type \" + joinType)\n}\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"前面語法解析和數據結構支持新的join type後，重點就是來修改三種Spark join物理算子的實現代碼了。首先是右表比較小時Spark會自動優化成BrocastHashJoin，這時右表通過broadcast拷貝到所有executor的內存裏，遍歷右表可以找到所有符合join condiction的行，如果右表沒有符合條件則保留左表internal row並且右表字段值爲null，如果有一行或多行符合條件就合併兩個internal row到輸出internal row裏，代碼實現在BroadcastHashJoinExec.scala中。因爲新增了join type枚舉類型，因此我們修改這兩個方法來表示支持這種join type，並且通過參數來區分和之前join type的實現。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":" override def doConsume(ctx: CodegenContext, input: Seq[ExprCode], row: ExprCode): String = {\n joinType match {\n case _: InnerLike => codegenInner(ctx, input)\n case LeftOuter | RightOuter => codegenOuter(ctx, input)\n \/\/ Add by 4Paradigm\n case LastJoinType => codegenOuter(ctx, input, true)\n case LeftSemi => codegenSemi(ctx, input)\n case LeftAnti => codegenAnti(ctx, input)\n case j: ExistenceJoin => codegenExistence(ctx, input)\n case x =>\n throw new IllegalArgumentException(\n s\"BroadcastHashJoin should not take $x as the JoinType\")\n }\n }\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"BrocastHashJoin的核心實現代碼也是使用JIT來實現的，因此我們需要修改codegen成Java代碼字符串的邏輯，在codegenOuter函數中，保留原來LeftOuterJoin的實現，並且使用前面的參數來區分是否使用新的join type實現。這裏修改的邏輯也非常簡單，因爲新的join type只要保證右表有一行數據拼到後就返回，因此不需要通過while來遍歷右表候選集。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":" \n \/\/ Add by 4Paradigm\n if (isLastJoin) {\n s\"\"\"\n |\/\/ generate join key for stream side\n |${keyEv.code}\n |\/\/ find matches from HashRelation\n |$iteratorCls $matches = $anyNull ? null : ($iteratorCls)$relationTerm.get(${keyEv.value});\n |boolean $found = false;\n |\/\/ the last iteration of this loop is to emit an empty row if there is no matched rows.\n |if ($matches != null && $matches.hasNext() || !$found) {\n | UnsafeRow $matched = $matches != null && $matches.hasNext() ?\n | (UnsafeRow) $matches.next() : null;\n | ${checkCondition.trim}\n | if ($conditionPassed) {\n | $found = true;\n | $numOutput.add(1);\n | ${consume(ctx, resultVars)}\n | }\n |}\n \"\"\".stripMargin\n }\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然後是修改SortMergeJoin的實現來支持新的join type，如果右表比較大不能直接broacast那麼大概率會使用SortMergeJoin實現，實現原理和前面的修改類似，不一樣的是這裏不是通過JIT實現的，因此直接修改拼表的邏輯即可，保證只要有一行符合條件即可拼接並返回。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":" private def bufferMatchingRows(): Unit = {\n assert(streamedRowKey != null)\n assert(!streamedRowKey.anyNull)\n assert(bufferedRowKey != null)\n assert(!bufferedRowKey.anyNull)\n assert(keyOrdering.compare(streamedRowKey, bufferedRowKey) == 0)\n \/\/ This join key may have been produced by a mutable projection, so we need to make a copy:\n matchJoinKey = streamedRowKey.copy()\n bufferedMatches.clear()\n \n \/\/ Add by 4Paradigm\n if (isLastJoin) {\n bufferedMatches.add(bufferedRow.asInstanceOf[UnsafeRow])\n advancedBufferedToRowWithNullFreeJoinKey()\n } else {\n do {\n bufferedMatches.add(bufferedRow.asInstanceOf[UnsafeRow])\n advancedBufferedToRowWithNullFreeJoinKey()\n } while (bufferedRow != null && keyOrdering.compare(streamedRowKey, bufferedRowKey) == 0)\n }\n \n }\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後是ShuffleHashJoin的實現，對應的實現在子類HashJoin.scala中，原理與前面也類似，調用outerJoin函數遍歷stream table的時候，修改核心的遍歷邏輯，保證左表在拼不到時保留並添加null，在拼到一行時立即返回即可。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"private def outerJoin(\n streamedIter: Iterator[InternalRow],\n hashedRelation: HashedRelation,\n isLastJoin: Boolean = false): Iterator[InternalRow] = {\n val joinedRow = new JoinedRow()\n val keyGenerator = streamSideKeyGenerator()\n val nullRow = new GenericInternalRow(buildPlan.output.length)\n \n streamedIter.flatMap { currentRow =>\n val rowKey = keyGenerator(currentRow)\n joinedRow.withLeft(currentRow)\n val buildIter = hashedRelation.get(rowKey)\n new RowIterator {\n private var found = false\n override def advanceNext(): Boolean = {\n \n \/\/ Add by 4Paradigm to support last join\n if (isLastJoin && found) {\n return false\n }\n \n \/\/ Add by 4Paradigm to support last join\n if (isLastJoin) {\n if (buildIter != null && buildIter.hasNext) {\n val nextBuildRow = buildIter.next()\n if (boundCondition(joinedRow.withRight(nextBuildRow))) {\n found = true\n return true\n }\n }\n } else {\n while (buildIter != null && buildIter.hasNext) {\n val nextBuildRow = buildIter.next()\n if (boundCondition(joinedRow.withRight(nextBuildRow))) {\n found = true\n return true\n }\n }\n }\n \n if (!found) {\n joinedRow.withRight(nullRow)\n found = true\n return true\n }\n false\n }\n override def getRow: InternalRow = joinedRow\n }.toScala\n }\n }\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過對前面JoinType和三種Join物理節點的修改，用戶就可以像其他內置join type一樣，使用SQL或者DataFrame接口來做新的拼表邏輯了，拼表後保證輸出行數與左表一致，結果和最前面基於LeftOuterJoin + dropDuplicated的方案也是一樣的。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"LastJoin實現性能對比"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼既然實現的新的Join算法，我們就對比前面兩種方案的性能吧，前面直接基於最新的Spark 3.0開源版，不修改Spark優化器的情況下對於小數據會使用broadcast join進行性能優化，後者直接使用修改Spark源碼編譯後的版本，在小數據下Spark也會優化成broadcast join實現。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先是測試join condiction能拼接多行的情況，對於LeftOuterJoin由於能拼接多行，因此第一個階段使用LeftOuterJoin輸出的表會大很多，第二階段dropDuplication也會更耗時，而LastJoin因爲在shuffle時拼接到單行就返回了，因此不會因爲拼接多行導致性能下降。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/5a\/d0\/5a73ef762f6374cfbbb2c24291c5f9d0.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從結果上看性能差異也很明顯，由於右表數據量都比較小，因此這三組數據Spark都會優化成broadcast join的實現，由於LeftOuterJoin會拼接多行，因此性能就比新的LastJoin慢很多，當數據量增大時LeftOuterJoin拼接的結果表數據量更加爆炸，性能成指數級下降，與LastJoin有數十倍到數百倍的差異，最後還可能因爲OOM導致失敗，而LastJoin不會因爲數據量增大有明顯的性能下降。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"右表能拼接多行對LeftOuterJoin + dropDupilicated方案多少有些不公平，因此我們新增一個測試場景，拼接時保證左表只可能與右表的一行拼接成功，這樣無論是LeftOuterJoin還是LastJoin結果都是一模一樣的，這種場景下性能對比更有意義。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/8y\/20\/8yy82e3890bcb2442faa927f7e973020.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從結果上看性能差異已經沒有那麼明顯了，但LastJoin還是會比前者方案快接近一倍，前面兩組右表數據量比較小被Spark優化成broadcast join實現，最後一組沒有優化會使用sorge merge join實現。從BroadcastHashJoin和SortMergeJoin最終生成的代碼可以看到，如果右表只有一行拼接成功的話，LeftOuterJoin和LastJoin的實現邏輯基本是一模一樣的，那麼性能差異主要在於前者方案還需要進行一次dropDuplicated計算，這個stage雖然計算複雜度不高但在小數據規模下耗時佔比還是比較大，無論是哪種測試方案在這種特殊的拼表場景下修改Spark源碼還是性能最優的實現方案。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"技術總結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後簡單總結下，OpenMLDB項目通過理解和修改Spark源碼，可以根據業務場景來實現新的拼表算法邏輯，從性能上看比使用原生Spark接口實現性能可以有巨大的提升。Spark源碼涉及SQL語法解析、Catalyst邏輯計劃優化、JIT代碼動態編譯等，擁有這些基礎後可以對Spark功能和性能進行更底層的拓展。"}]}]}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

相關文章

巧用 TiCDC Syncpiont 構建銀行實時交易和準實時計算一體化架構

本文闡述了某商業銀行如何利用 TiCDC Syncpoint 功能，在 TiDB 平臺上構建一個既能處理實時交易又能進行準實時計算的一體化架構，用以優化其零售資格業務系統的實踐。通過遷移到 TiDB 並巧妙應用 Syncpoint，該銀行成

2024-04-30 22:24:58

Apache DolphinScheduler支持Flink嗎？

隨着大數據技術的快速發展，很多企業開始將Flink引入到生產環境中，以滿足日益複雜的數據處理需求。而作爲一款企業級的數據調度平臺，Apache DolphinScheduler也跟上了時代步伐，推出了對Flink任務類型的支持。 Flink

2024-04-30 11:49:27

華爲云云原生FinOps解決方案，釋放雲原生最大價值

華爲云云原生FinOps通過可視化的成本洞察和成本優化，幫助用戶精細用雲以提升單位成本的資源利用率，實現降本增效目標企業上雲現狀：上雲趨勢持續加深，但云上開支存在顯著浪費根據Flexer 2024年最新的一項調查顯示，當前有超過7

2024-04-29 22:33:46

三喜臨門！信必優連收三家金融行業客戶表揚信

近日，信必優陸續收到全球知名銀行客戶、中國證券行業TOP級客戶、中國期貨行業TOP級客戶的表揚信。客戶高度讚揚我司員工在工作中表現突出，以積極主動、團結協作的工作態度和出色的技術能力，在技術團隊中做出表率，爲項目的順利交付做出重要貢獻。

2024-04-29 22:32:22

數字化轉型新篇章：企業通往智能化的新範式

早在十多年前，一些具有前瞻視野的企業以實現“數字化”爲目標啓動轉型實踐。但時至今日，可以說尚無幾家企業能夠在真正意義上實現“數字化”。在實現“數字化”的征途上，人們發現，努力愈進，彷彿終點愈遠。究其原因，還在於轉型一直落後於技術邊界的拓展

2024-04-29 21:22:20

從NoSQL到NewSQL——10年代大數據浪潮下的技術革新

引言在數字化浪潮的推動下，數據庫技術已成爲支撐數字經濟的堅實基石。騰訊雲 TVP《技術指針》聯合《明說三人行》特別策劃的直播系列——【中國數據庫前世今生】，我們將通過五期直播，帶您穿越五個十年，深入探討每個時代的數據庫演變

2024-04-28 23:12:26

“百團大戰”下，20年代的國產數據庫如何乘風破浪？

引言在數字化浪潮的推動下，數據庫技術已成爲支撐數字經濟的堅實基石。騰訊雲 TVP《技術指針》聯合《明說三人行》特別策劃的直播系列——【中國數據庫前世今生】，我們將通過五期直播，帶您穿越五個十年，深入探討每個時代的數據庫

2024-04-28 23:12:24

大數據小白的測試成長之路

引言 22年校招入職京東後，我一直在數據中臺測試部從事測試開發的工作。畢業後，寫的最多的文檔是測試計劃和測試報告，鮮有機會就自己的成長碼字進行回顧和總結。借“up技術人”欄目，也終於是在工作之餘回頭望，對自己這近兩年時光進行一個小總結

2024-04-28 11:17:19

如何從0到1設計診斷系統

引言在整車電子電氣體系中，診斷系統的設計扮演着至關重要的角色，負責支持整車的刷寫、故障排查和EOL(End of Line)等關鍵操作。這一重要性在於這些操作的實現都依賴於診斷系統的全面支持。因此，在設計診斷系統時，必須確保

2024-04-26 22:43:26

華爲雲Stack8.3面向香港正式發佈，六大亮點激發雲上躍遷

本文分享自華爲雲社區《華爲雲Stack8.3面向香港正式發佈，六大亮點激發雲上躍遷》，作者：華爲雲頭條。 2024年4月23日，在華爲雲香港峯會2024上，華爲混合雲副總裁胡玉海面向香港市場發佈華爲雲Stack8.3，提供110+本地

2024-04-26 10:33:21

對接HiveMetaStore，擁抱開源大數據

本文分享自華爲雲社區《對接HiveMetaStore，擁抱開源大數據》，作者：睡覺是大事。 1. 前言適用版本：9.1.0及以上在大數據融合分析時代，面對海量的數據以及各種複雜的查詢，性能是我們使用一款數據處理引擎最重要的考量

2024-04-24 22:33:08

重磅新品發佈！雲耀數據庫HRDS，享受輕量級的極致體驗

本文分享自華爲雲社區《重磅新品發佈！雲耀數據庫HRDS，享受輕量級的極致體驗！》，作者：GaussDB 數據庫。所謂，凡有井水處，即能歌柳詞。大數據時代，凡有數據處，必有數據庫。隨着業務需求的不斷擴大和數據量的激增，數

2024-04-23 22:32:33

沙特2030年願景和對中國IT企業的市場機會分析

沙特2030年願景和對中國IT企業的市場機會分析前言：最近“開源老DJ，帶你去沙特”欄目第一期已經播出，收到了不錯的反響。見COPU官網的回顧。（https://mp.weixin.qq.com/s/3B0jNVhybxTF1xPiy

2024-04-23 22:24:54

03-爲啥大模型LLM還沒能完全替代你？

1 不具備記憶能力的它是零狀態的，我們平常在使用一些大模型產品，尤其在使用他們的API的時候，我們會發現那你和它對話，尤其是多輪對話的時候，經過一些輪次後，這些記憶就消失了，因爲它也記不住那麼多。 2 上下文窗口的限制大模型對其inpu

2024-04-23 01:07:00

入職3年-我如何做一名AI產品經理

引言從2021年校招加入京東開始，我一直從事AI產品經理的工作，有幸見證了AI行業的熱情從一臺臺服務器燒到了全世界各個角落，也見證了京東AI中臺團隊的影響力如何一步步的擴大。從21年的迷茫到24年的堅定，很慶幸我正走在適合自己的道路上，

2024-04-22 11:16:31

24小時熱門文章

最新文章

第四範式OpenMLDB: 拓展Spark源碼實現高性能Join

最新評論文章