Flink rest接口查詢metric的整體鏈路

Flink metrics

最近根據flinkUI的接口追蹤了一下flink的metric信息的查詢過程,在這裏記錄一下。代碼基於release1.9分支

入口

首先我這裏的入口指的是flink UI中的/jobs/:jobid這個rest接口,具體處理邏輯是在JobDetailsHandler,JobDetailsHandler的註冊以及與path的綁定可以看WebMonitorEndpoint部分,這一塊不是本文的具體內容,其中查詢的邏輯爲JobDetailsHandler.handleRequest。

protected JobDetailsInfo handleRequest(
			HandlerRequest<EmptyRequestBody, JobMessageParameters> request,
			AccessExecutionGraph executionGraph) throws RestHandlerException {
		return createJobDetailsInfo(executionGraph, metricFetcher);
	}

注意那個metricFeatcher成員變量,那個是查詢metric的核心
我們根據返回的JobDetailsInfo對象的具體字段看一下metrics書如何查詢的

首先看一下rest接口的響應

{
    "jid": "c0f3aa66449fe0ce19b4e03323fee5ae",
    "name": "antc4blink788000260",
    "isStoppable": false,
    "state": "RUNNING",
    "start-time": 1591704258551,
    "end-time": -1,
    "duration": 68423742,
    "now": 1591772682293,
    ...
    "vertices": [
        {
            "id": "717c7b8afebbfb7137f6f0f99beb2a94",
            "topology-id": 0,
            "name": "Source: GSlsTableSource-dws_behavior_ri_source-Stream -> SourceConversion(table:[builtin, default, _DataStreamTable_0, source: [GSlsTableSource-dws_behavior_ri_source]], fields:(f0)) -> correlate: table(SlsParser_dws_behavior_ri_source0($cor0.f0)), select: user_id,event_time,item_id,biz_type -> Calc(select: (user_id AS userid, event_time AS eventtime, item_id AS itemid, biz_type, ((MD5(user_id) SUBSTR 1 SUBSTR 4) CONCAT '#' CONCAT user_id) AS rowkey_user_id, (CAST((event_time SUBSTR 1 SUBSTR 10)) FROM_UNIXTIME 'yyyyMMdd') AS day_format, ((MD5((CAST((event_time SUBSTR 1 SUBSTR 10)) FROM_UNIXTIME 'yyyyMMdd')) SUBSTR 1 SUBSTR 4) CONCAT '#' CONCAT (CAST((event_time SUBSTR 1 SUBSTR 10)) FROM_UNIXTIME 'yyyyMMdd')) AS rowkey_day_format, (CAST((event_time SUBSTR 1 SUBSTR 10)) FROM_UNIXTIME 'HH') AS hour_format, ((MD5(item_id) SUBSTR 1 SUBSTR 4) CONCAT '_' CONCAT item_id) AS rowkey_item_id)) -> AsyncJoinTable(table: (AliHBase: [myddsczssd_time_sematic]), joinType: LeftOuterJoin, join: (userid, ...",
            "parallelism": 128,
            "status": "RUNNING",
            "start-time": 1591704339321,
            "end-time": -1,
            "duration": 68342972,
            "tasks": {
                "RUNNING": 128,
                "CANCELED": 0,
                "CANCELING": 0,
                "FAILED": 0,
                "FINISHED": 0,
                "CREATED": 0,
                "RECONCILING": 0,
                "SCHEDULED": 0,
                "DEPLOYING": 0
            },
            // metric信息在這
            "metrics": {
                "read-bytes": 0,
                "read-bytes-complete": true,
                "write-bytes": 0,
                "write-bytes-complete": true,
                "read-records": 0,
                "read-records-complete": true,
                "write-records": 0,
                "write-records-complete": true,
                "buffers-in-pool-usage-max": 0.0,
                "buffers-in-pool-usage-max-complete": true,
                "buffers-out-pool-usage-max": 0.0,
                "buffers-out-pool-usage-max-complete": true,
                "tps": 0.21666666666666667,
                "tps-complete": true,
                "delay": 1383,
                "delay-complete": true
            }
        }
    ],
    ...
}

找一下其在對應的model的具體字段

public class JobDetailsInfo implements ResponseBody {
	@JsonProperty(FIELD_NAME_JOB_ID)
	@JsonSerialize(using = JobIDSerializer.class)
	private final JobID jobId;

	@JsonProperty(FIELD_NAME_JOB_NAME)
	private final String name;

	@JsonProperty(FIELD_NAME_IS_STOPPABLE)
	private final boolean isStoppable;

	@JsonProperty(FIELD_NAME_JOB_STATUS)
	private final JobStatus jobStatus;

	@JsonProperty(FIELD_NAME_START_TIME)
	private final long startTime;

	@JsonProperty(FIELD_NAME_END_TIME)
	private final long endTime;

	@JsonProperty(FIELD_NAME_DURATION)
	private final long duration;

	@JsonProperty(FIELD_NAME_NOW)
	private final long now;

	@JsonProperty(FIELD_NAME_TIMESTAMPS)
	private final Map<JobStatus, Long> timestamps;

    // metrics字段在這個類的字段裏面
	@JsonProperty(FIELD_NAME_JOB_VERTEX_INFOS)
	private final Collection<JobVertexDetailsInfo> jobVertexInfos;

	@JsonProperty(FIELD_NAME_JOB_VERTICES_PER_STATE)
	private final Map<ExecutionState, Integer> jobVerticesPerState;

	@JsonProperty(FIELD_NAME_JSON_PLAN)
	@JsonRawValue
	private final String jsonPlan;
}

public static final class JobVertexDetailsInfo {

		@JsonProperty(FIELD_NAME_JOB_VERTEX_ID)
		@JsonSerialize(using = JobVertexIDSerializer.class)
		private final JobVertexID jobVertexID;

		@JsonProperty(FIELD_NAME_TOPOLOGY_ID)
		private final int topologyID;

		@JsonProperty(FIELD_NAME_JOB_VERTEX_NAME)
		private final String name;

		@JsonProperty(FIELD_NAME_PARALLELISM)
		private final int parallelism;

		@JsonProperty(FIELD_NAME_JOB_VERTEX_STATE)
		private final ExecutionState executionState;

		@JsonProperty(FIELD_NAME_JOB_VERTEX_START_TIME)
		private final long startTime;

		@JsonProperty(FIELD_NAME_JOB_VERTEX_END_TIME)
		private final long endTime;

		@JsonProperty(FIELD_NAME_JOB_VERTEX_DURATION)
		private final long duration;

		@JsonProperty(FIELD_NAME_TASKS_PER_STATE)
		private final Map<ExecutionState, Integer> tasksPerState;

        // metrics對象在這
		@JsonProperty(FIELD_NAME_JOB_VERTEX_METRICS)
		private final IOMetricsInfo jobVertexMetrics;
}

構造邏輯這個對象的邏輯是在JobDetailsHandler.createJobVertexDetailsInfo處

private static JobDetailsInfo.JobVertexDetailsInfo createJobVertexDetailsInfo(
			AccessExecutionJobVertex ejv,
			long now,
			JobID jobId,
			int topologyID,
			MetricFetcher<?> metricFetcher) {

        ...

        // 構造IOMetricsInfo
		for (AccessExecutionVertex vertex : ejv.getTaskVertices()) {
            // 核心邏輯
			counts.addIOMetrics(
				vertex.getCurrentExecutionAttempt(),
				metricFetcher,
				jobId.toString(),
				ejv.getJobVertexId().toString());
		}


		final IOMetricsInfo jobVertexMetrics = new IOMetricsInfo(counts);

		return new JobDetailsInfo.JobVertexDetailsInfo(
			ejv.getJobVertexId(),
			topologyID,
			ejv.getName(),
			ejv.getParallelism(),
			jobVertexState,
			startTime,
			endTime,
			duration,
			tasksPerStateMap,
			jobVertexMetrics);
	}

看一下addIOMetrics邏輯

public void addIOMetrics(AccessExecution attempt, @Nullable MetricFetcher fetcher, String jobID, String taskID) {
    // 如果AccessExecution執行完了,或者說停止了,直接嘗試從AccessExecution裏獲取
		if (attempt.getState().isTerminal()) {
			IOMetrics ioMetrics = attempt.getIOMetrics();
			if (ioMetrics != null) { // execAttempt is already finished, use final metrics stored in ExecutionGraph
				this.numBytesIn += ioMetrics.getNumBytesIn();
				this.numBytesOut += ioMetrics.getNumBytesOut();
				this.numRecordsIn += ioMetrics.getNumRecordsIn();
				this.numRecordsOut += ioMetrics.getNumRecordsOut();
			}
		} else { // execAttempt is still running, use MetricQueryService instead
        // 如果AccessExecution還在跑,則通過MetricFetcher獲取metric
			if (fetcher != null) {
				fetcher.update();
				MetricStore.ComponentMetricStore metrics = fetcher.getMetricStore()
					.getSubtaskMetricStore(jobID, taskID, attempt.getParallelSubtaskIndex());
				if (metrics != null) {
					/**
					 * We want to keep track of missing metrics to be able to make a difference between 0 as a value
					 * and a missing value.
					 * In case a metric is missing for a parallel instance of a task, we set the complete flag as
					 * false.
					 */
					if (metrics.getMetric(MetricNames.IO_NUM_BYTES_IN) == null){
						this.numBytesInComplete = false;
					}
					else {
						this.numBytesIn += Long.valueOf(metrics.getMetric(MetricNames.IO_NUM_BYTES_IN));
					}

					if (metrics.getMetric(MetricNames.IO_NUM_BYTES_OUT) == null){
						this.numBytesOutComplete = false;
					}
					else {
						this.numBytesOut += Long.valueOf(metrics.getMetric(MetricNames.IO_NUM_BYTES_OUT));
					}

					if (metrics.getMetric(MetricNames.IO_NUM_RECORDS_IN) == null){
						this.numRecordsInComplete = false;
					}
					else {
						this.numRecordsIn += Long.valueOf(metrics.getMetric(MetricNames.IO_NUM_RECORDS_IN));
					}

					if (metrics.getMetric(MetricNames.IO_NUM_RECORDS_OUT) == null){
						this.numRecordsOutComplete = false;
					}
					else {
						this.numRecordsOut += Long.valueOf(metrics.getMetric(MetricNames.IO_NUM_RECORDS_OUT));
					}
				}
				else {
					this.numBytesInComplete = false;
					this.numBytesOutComplete = false;
					this.numRecordsInComplete = false;
					this.numRecordsOutComplete = false;
				}
			}
		}
	}

我們先看通過MetricFetcher方式獲取的步驟

// 更新metric
fetcher.update();
// 獲取更新後的metric
MetricStore.ComponentMetricStore metrics = fetcher.getMetricStore()
			.getSubtaskMetricStore(jobID, taskID, attempt.getParallelSubtaskIndex());
if (metrics != null) {

上述代碼就是獲取到metric的核心步驟

先看update

@Override
	public void update() {
		synchronized (this) {
			long currentTime = System.currentTimeMillis();
			if (currentTime - lastUpdateTime > updateInterval) {
				lastUpdateTime = currentTime;
                // 核心邏輯
				fetchMetrics();
			}
		}
	}


private void fetchMetrics() {
		LOG.debug("Start fetching metrics.");

		try {
            // 清理步驟
			Optional<T> optionalLeaderGateway = retriever.getNow();
			if (optionalLeaderGateway.isPresent()) {
				final T leaderGateway = optionalLeaderGateway.get();

				/*
				 * Remove all metrics that belong to a job that is not running and no longer archived.
				 */
				CompletableFuture<MultipleJobsDetails> jobDetailsFuture = leaderGateway.requestMultipleJobDetails(timeout);

				jobDetailsFuture.whenCompleteAsync(
					(MultipleJobsDetails jobDetails, Throwable throwable) -> {
						if (throwable != null) {
							LOG.debug("Fetching of JobDetails failed.", throwable);
						} else {
							ArrayList<String> toRetain = new ArrayList<>(jobDetails.getJobs().size());
							for (JobDetails job : jobDetails.getJobs()) {
								toRetain.add(job.getJobId().toString());
							}
							metrics.retainJobs(toRetain);
						}
					},
					executor);
                
                // 拿JobManagerRunner的metric akka path
				CompletableFuture<Collection<String>> queryServiceAddressesFuture = leaderGateway.requestMetricQueryServiceAddresses(timeout);

				queryServiceAddressesFuture.whenCompleteAsync(
					(Collection<String> queryServiceAddresses, Throwable throwable) -> {
						if (throwable != null) {
							LOG.debug("Requesting paths for query services failed.", throwable);
						} else {
							for (String queryServiceAddress : queryServiceAddresses) {
                                // 調用該方法用於查詢metrics
								retrieveAndQueryMetrics(queryServiceAddress);
							}
						}
					},
					executor);
                
                // 獲取TaskManagerRunner的akka path

				// TODO: Once the old code has been ditched, remove the explicit TaskManager query service discovery
				// TODO: and return it as part of requestMetricQueryServiceAddresses. Moreover, change the MetricStore such that
				// TODO: we don't have to explicitly retain the valid TaskManagers, e.g. letting it be a cache with expiry time
				CompletableFuture<Collection<Tuple2<ResourceID, String>>> taskManagerQueryServiceGatewaysFuture = leaderGateway
					.requestTaskManagerMetricQueryServiceAddresses(timeout);

				taskManagerQueryServiceGatewaysFuture.whenCompleteAsync(
					(Collection<Tuple2<ResourceID, String>> queryServiceGateways, Throwable throwable) -> {
						if (throwable != null) {
							LOG.debug("Requesting TaskManager's path for query services failed.", throwable);
						} else {
							List<String> taskManagersToRetain = queryServiceGateways
								.stream()
								.map(
									(Tuple2<ResourceID, String> tuple) -> {
										queryServiceRetriever.retrieveService(tuple.f1)
											.thenAcceptAsync(this::queryMetrics, executor);
										return tuple.f0.getResourceIdString();
									}
								).collect(Collectors.toList());

							metrics.retainTaskManagers(taskManagersToRetain);
						}
					},
					executor);
			}
		} catch (Exception e) {
			LOG.debug("Exception while fetching metrics.", e);
		}
	}

接下來看retrieveAndQueryMetrics方法

/**
	 * Retrieves and queries the specified QueryServiceGateway.
	 *
	 * @param queryServiceAddress specifying the QueryServiceGateway
	 */
	private void retrieveAndQueryMetrics(String queryServiceAddress) {
		LOG.debug("Retrieve metric query service gateway for {}", queryServiceAddress);

        // 嘗試使用RpcService去connect並獲取一個MetricQueryServiceGateway
		final CompletableFuture<MetricQueryServiceGateway> queryServiceGatewayFuture = queryServiceRetriever.retrieveService(queryServiceAddress);

		queryServiceGatewayFuture.whenCompleteAsync(
			(MetricQueryServiceGateway queryServiceGateway, Throwable t) -> {
				if (t != null) {
					LOG.debug("Could not retrieve QueryServiceGateway.", t);
				} else {
                    // 連上MetricQueryServiceGateway服務後執行查詢
					queryMetrics(queryServiceGateway);
				}
			},
			executor);
	}

	/**
	 * Query the metrics from the given QueryServiceGateway.
	 *
	 * @param queryServiceGateway to query for metrics
	 */
	private void queryMetrics(final MetricQueryServiceGateway queryServiceGateway) {
		LOG.debug("Query metrics for {}.", queryServiceGateway.getAddress());

		queryServiceGateway
        // 調用查詢接口
			.queryMetrics(timeout)
			.whenCompleteAsync(
				(MetricDumpSerialization.MetricSerializationResult result, Throwable t) -> {
					if (t != null) {
						LOG.debug("Fetching metrics failed.", t);
					} else {
						metrics.addAll(deserializer.deserialize(result));
					}
				},
				executor);
	}

接下來看一下查詢接口乾了什麼

    @Override
	public CompletableFuture<MetricDumpSerialization.MetricSerializationResult> queryMetrics(Time timeout) {
		return callAsync(() -> enforceSizeLimit(serializer.serialize(counters, gauges, histograms, meters)), timeout);
	}

enforceSizeLimit方法是用來對序列化後的結果進行limit操作,邏輯大致就是>limit就置爲空,核心邏輯還是在serializer.serialize部分

public MetricSerializationResult serialize(
			Map<Counter, Tuple2<QueryScopeInfo, String>> counters,
			Map<Gauge<?>, Tuple2<QueryScopeInfo, String>> gauges,
			Map<Histogram, Tuple2<QueryScopeInfo, String>> histograms,
			Map<Meter, Tuple2<QueryScopeInfo, String>> meters) {

			countersBuffer.clear();
			int numCounters = 0;
			for (Map.Entry<Counter, Tuple2<QueryScopeInfo, String>> entry : counters.entrySet()) {
				try {
					serializeCounter(countersBuffer, entry.getValue().f0, entry.getValue().f1, entry.getKey());
					numCounters++;
				} catch (Exception e) {
					LOG.debug("Failed to serialize counter.", e);
				}
			}

			gaugesBuffer.clear();
			int numGauges = 0;
			for (Map.Entry<Gauge<?>, Tuple2<QueryScopeInfo, String>> entry : gauges.entrySet()) {
				try {
					serializeGauge(gaugesBuffer, entry.getValue().f0, entry.getValue().f1, entry.getKey());
					numGauges++;
				} catch (Exception e) {
					LOG.debug("Failed to serialize gauge.", e);
				}
			}

			histogramsBuffer.clear();
			int numHistograms = 0;
			for (Map.Entry<Histogram, Tuple2<QueryScopeInfo, String>> entry : histograms.entrySet()) {
				try {
					serializeHistogram(histogramsBuffer, entry.getValue().f0, entry.getValue().f1, entry.getKey());
					numHistograms++;
				} catch (Exception e) {
					LOG.debug("Failed to serialize histogram.", e);
				}
			}

			metersBuffer.clear();
			int numMeters = 0;
			for (Map.Entry<Meter, Tuple2<QueryScopeInfo, String>> entry : meters.entrySet()) {
				try {
					serializeMeter(metersBuffer, entry.getValue().f0, entry.getValue().f1, entry.getKey());
					numMeters++;
				} catch (Exception e) {
					LOG.debug("Failed to serialize meter.", e);
				}
			}

			return new MetricSerializationResult(
				countersBuffer.getCopyOfBuffer(),
				gaugesBuffer.getCopyOfBuffer(),
				metersBuffer.getCopyOfBuffer(),
				histogramsBuffer.getCopyOfBuffer(),
				numCounters,
				numGauges,
				numMeters,
				numHistograms);
		}

核心邏輯就是針對gauge,counter,meter,histogram這四種類型的metric來進行響應的序列化函數調用

以serializeGauge爲例

private static void serializeGauge(DataOutput out, QueryScopeInfo info, String name, Gauge<?> gauge) throws IOException {
		Object value = gauge.getValue();
		if (value == null) {
			throw new NullPointerException("Value returned by gauge " + name + " was null.");
		}
		String stringValue = value.toString();
		if (stringValue == null) {
			throw new NullPointerException("toString() of the value returned by gauge " + name + " returned null.");
		}

		serializeMetricInfo(out, info);
		out.writeUTF(name);
		out.writeUTF(stringValue);
	}

核心其實就是調用gauge的getValue()接口獲取到了metric值。查詢的邏輯就到此爲止了。如何上述metric的設置是通過MetricRegistry的register方法添加進來的。

總結

總結起來的流程應該如下

rest接口
 -> 通過dispatcher獲取到各個jvm進程級別(TaskManger,JobManager)的提供flink-metrics服務的akka actor的path 
 -> 通過akka path調用相關的rpc服務 
 -> 各個jvm進程級別的rpc服務調用各個類型的metric的獲取值的方法計算得到值,並序列化成一個結果 
 -> MetricFetcher 
 -> 填充response並返回
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章