DataX教程（08）- 监控与汇报

文章目录

01 引言
02 监控功能
- 2.1 ErrorRecordChecker
- 2.2 ErrorRecordChecker源码
- 2.3 ErrorRecordChecker检查时机
03 汇报功能
- 3.1 汇报运行流程
- 3.2 汇报的运行流程
- - 3.2.1 汇报的几个角色
  - 3.2.2 汇报的流程
- 3.3 什么时候写信息内容
- 3.4 Channel通讯信息接收
04 文末

01 引言

通过前面的博文，我们对DataX有了一定的深入了解了：

《DataX教程（01）- 入门》
《DataX教程（02）- IDEA运行DataX完整流程（填完所有的坑）》
《DataX教程（03）- 源码解读（超详细版）
《DataX教程（04）- 配置完整解读》
《DataX教程（05）- DataX Web项目实践》
《DataX教程（06）- DataX调优》
《DataX教程（07）- 图解DataX任务分配及执行流程》

本文主要讲解DataX的监控与汇报功能。

02 监控功能 2.1 ErrorRecordChecker

在JobContainer类里面，可以看到引用了一个类ErrorRecordChecker，它在JobContainer初始化的时候做了初始操作。在这里插入图片描述 ErrorChecker是一个监控类，主要用来检查任务是否到达错误记录限制。有检查条数（recordLimit）和百分比(percentageLimit)两种方式：

errorRecord表示出错条数不能大于限制数，当超过时任务失败。比如errorRecord为0表示不容许任何脏数据；
errorPercentage表示出错比例，在任务结束时校验；
errorRecord优先级高于errorPercentage。

2.2 ErrorRecordChecker源码

Control+O可以看到ErrorRecordChecker，有如下几个方法：在这里插入图片描述这里主要做简要描述，

① 构造函数ErrorRecordChecker(Configuration configuration)：主要就是从任务配置文件job.json里面获取errorLimit.record错误记录数限制及errorLimit.percentage错误记录百分比的值：

public ErrorRecordChecker(Configuration configuration) { this(configuration.getLong(CoreConstant.DATAX_JOB_SETTING_ERRORLIMIT_RECORD), configuration.getDouble(CoreConstant.DATAX_JOB_SETTING_ERRORLIMIT_PERCENT)); }

② 检查错误记录数限制checkRecordLimit(Communication communication)：主要就是从communication里获取总共的错误记录数，然后判断是否超出配置的值，如果是，则抛出异常

public void checkRecordLimit(Communication communication) { if (recordLimit == null) { return; } long errorNumber = CommunicationTool.getTotalErrorRecords(communication); if (recordLimit < errorNumber) { LOG.debug( String.format("Error-limit set to %d, error count check.", recordLimit)); throw DataXException.asDataXException( FrameworkErrorCode.PLUGIN_DIRTY_DATA_LIMIT_EXCEED, String.format("脏数据条数检查不通过，限制是[%d]条，但实际上捕获了[%d]条.", recordLimit, errorNumber)); } }

③ 检查错误记录百分比checkPercentageLimit(Communication communication)：主要就是从communication里获取总共的错误记录数与总数的百分比值，然后判断是否超出配置的值，如果是，则抛出异常：

public void checkPercentageLimit(Communication communication) { if (percentageLimit == null) { return; } LOG.debug(String.format( "Error-limit set to %f, error percent check.", percentageLimit)); long total = CommunicationTool.getTotalReadRecords(communication); long error = CommunicationTool.getTotalErrorRecords(communication); if (total > 0 && ((double) error / (double) total) > percentageLimit) { throw DataXException.asDataXException( FrameworkErrorCode.PLUGIN_DIRTY_DATA_LIMIT_EXCEED, String.format("脏数据百分比检查不通过，限制是[%f]，但实际上捕获到[%f].", percentageLimit, ((double) error / (double) total))); } }

好了，这里就讲完了ErrorRecordChecker的功能了，注意check方法里面有一个Communication类，这是一个通讯类，主要用来保存当前任务的状态信息的，接下来也会讲解。

2.3 ErrorRecordChecker检查时机

Control点击可以看到ErrorRecordChecker被JobContainer调用（初始化，前面已讲），以及在AbstractScheduler任务任务调度schedule方法执行的时候调用了。在这里插入图片描述再看看check方法在哪里调用了，经过追踪，可以分析得出：

在JobContainer的schedule方法结束后会调用，检查整个任务的错误记录数
在AbstractScheduler的schedule方法，里面开了一个while死循环，不断去采集任务的状态，检查的时间间隔配置（core.container.job.sleepInterval）在core.json里面的job.sleepInterval里配置。

最后贴下，AbstractScheduler的schedule方法实现实时采集的代码：

while (true) { /**
     * step 1: collect job stat
     * step 2: getReport info, then report it
     * step 3: errorLimit do check
     * step 4: dealSucceedStat();
     * step 5: dealKillingStat();
     * step 6: dealFailedStat();
     * step 7: refresh last job stat, and then sleep for next while
     *
     * above steps, some ones should report info to DS
     *
     */ Communication nowJobContainerCommunication = this.containerCommunicator.collect(); nowJobContainerCommunication.setTimestamp(System.currentTimeMillis()); LOG.debug(nowJobContainerCommunication.toString()); //汇报周期 long now = System.currentTimeMillis(); if (now - lastReportTimeStamp > jobReportIntervalInMillSec) { Communication reportCommunication = CommunicationTool .getReportCommunication(nowJobContainerCommunication, lastJobContainerCommunication, totalTasks); this.containerCommunicator.report(reportCommunication); lastReportTimeStamp = now; lastJobContainerCommunication = nowJobContainerCommunication; } errorLimit.checkRecordLimit(nowJobContainerCommunication); if (nowJobContainerCommunication.getState() == State.SUCCEEDED) { LOG.info("Scheduler accomplished all tasks."); break; } if (isJobKilling(this.getJobId())) { dealKillingStat(this.containerCommunicator, totalTasks); } else if (nowJobContainerCommunication.getState() == State.FAILED) { dealFailedStat(this.containerCommunicator, nowJobContainerCommunication.getThrowable()); } Thread.sleep(jobSleepIntervalInMillSec); }

03 汇报功能 3.1 汇报运行流程

友情提示：可能图片较大，建议下载下来使用图片编辑器查看。

首先贴上一张图，里面描述的是Scheduler调度器与ErrorRecordChecker错误检查器及Communicator通讯者的整个调用关系，从上往下看：在这里插入图片描述

3.2 汇报的运行流程 3.2.1 汇报的几个角色

汇报主要有几个重要的角色：

AbstractCommunicator通讯者抽象类：主要用来做通讯的协调；
Communication通讯的信息载体：主要用来存放通讯过程中产生的信息，为单例；
LocalTGCommunicationManager通讯信息载体工厂：根据任务id来获取通讯信息载体单例的工厂；
CommunicationTool信息载体工具类：此工具类是通讯业务层的处理，主要用来收集当前信息，并写入到Communication通讯的信息载体；
AbstractReporter信息上报：用来上报通讯信息。

3.2.2 汇报的流程

简要的流程描述：

首先根据配置new一个通讯者对象，有两种，分别为“StandAloneJobContainerCommunicator”、“StandAloneTGContainerCommunicator”，生成后，注入进Scheduler调度者，此时，Scheduler就有了一个Communicator工具了;
通讯者Communicator使用collect方法生成通讯的载体，也就是Communication，用来存放任务的相关信息，ErrorRecorder就是从这个Communication里获取当前任务的信息的;
Scheduler调度器类里面，使用Communicator通讯工具的collect方法来获取communication通讯载体单例（获取单例方法在LocalTGCommunicationManager类，里面定义了Map，key为任务id，value为Communication通讯载体）;
Scheduler获取到Communication通讯载体后，使用CommunicationTool工具类把当前任务的状态信息写入；
最后使用reporter来上报Communication信息。

3.3 什么时候写信息内容

前面的3.1和3.2只做到了通讯类Communicator和通讯信息载体Communication的初始化，以及上报的流程，但是没有针对到哪里写入内容到Communication？这里直接看写入信息到Communication的地方，核心内容在TaskGroupContainer里面，下面来看看：

①首先根据任务id获取Communication的代码地方，在内部类TaskExecutor构造函数的地方：在这里插入图片描述 ②把Communication注入进Channel通道类，Channel通道类主要做内容的记录（核心：统计和限速都在这里）： ③Channel注入进了BufferedRecordExchanger或BufferedRecordTransformerExchanger，而这连个Exchanger主要是为了记录RecordSender记录发送者、RecordReceiver记录接收者、TransformerExchanger的内容，就是记录ETL这3个模块里面的内容在这里插入图片描述

根据流程，可以看到Channel类使用来收集ETL的信息的，那么看看Channel这个类的一些核心方法。

3.4 Channel通讯信息接收

Channel类有很多的方法，Control+O可以看到：在这里插入图片描述举个例子，可以看看Channel的push(final Record r)方法:

public void push(final Record r) { Validate.notNull(r, "record不能为空."); this.doPush(r); this.statPush(1L, r.getByteSize()); }

进入statPush方法：

private void statPush(long recordSize, long byteSize) { currentCommunication.increaseCounter(CommunicationTool.READ_SUCCEED_RECORDS, recordSize); currentCommunication.increaseCounter(CommunicationTool.READ_SUCCEED_BYTES, byteSize); //在读的时候进行统计waitCounter即可，因为写（pull）的时候可能正在阻塞，但读的时候已经能读到这个阻塞的counter数 currentCommunication.setLongCounter(CommunicationTool.WAIT_READER_TIME, waitReaderTime); currentCommunication.setLongCounter(CommunicationTool.WAIT_WRITER_TIME, waitWriterTime); boolean isChannelByteSpeedLimit = (this.byteSpeed > 0); boolean isChannelRecordSpeedLimit = (this.recordSpeed > 0); if (!isChannelByteSpeedLimit && !isChannelRecordSpeedLimit) { return; } long lastTimestamp = lastCommunication.getTimestamp(); long nowTimestamp = System.currentTimeMillis(); long interval = nowTimestamp - lastTimestamp; if (interval - this.flowControlInterval >= 0) { long byteLimitSleepTime = 0; long recordLimitSleepTime = 0; if (isChannelByteSpeedLimit) { long currentByteSpeed = (CommunicationTool.getTotalReadBytes(currentCommunication) - CommunicationTool.getTotalReadBytes(lastCommunication)) * 1000 / interval; if (currentByteSpeed > this.byteSpeed) { // 计算根据byteLimit得到的休眠时间 byteLimitSleepTime = currentByteSpeed * interval / this.byteSpeed - interval; } } if (isChannelRecordSpeedLimit) { long currentRecordSpeed = (CommunicationTool.getTotalReadRecords(currentCommunication) - CommunicationTool.getTotalReadRecords(lastCommunication)) * 1000 / interval; if (currentRecordSpeed > this.recordSpeed) { // 计算根据recordLimit得到的休眠时间 recordLimitSleepTime = currentRecordSpeed * interval / this.recordSpeed - interval; } } // 休眠时间取较大值 long sleepTime = byteLimitSleepTime < recordLimitSleepTime ? recordLimitSleepTime : byteLimitSleepTime; if (sleepTime > 0) { try { Thread.sleep(sleepTime); } catch (InterruptedException e) { Thread.currentThread().interrupt(); } } lastCommunication.setLongCounter(CommunicationTool.READ_SUCCEED_BYTES, currentCommunication.getLongCounter(CommunicationTool.READ_SUCCEED_BYTES)); lastCommunication.setLongCounter(CommunicationTool.READ_FAILED_BYTES, currentCommunication.getLongCounter(CommunicationTool.READ_FAILED_BYTES)); lastCommunication.setLongCounter(CommunicationTool.READ_SUCCEED_RECORDS, currentCommunication.getLongCounter(CommunicationTool.READ_SUCCEED_RECORDS)); lastCommunication.setLongCounter(CommunicationTool.READ_FAILED_RECORDS, currentCommunication.getLongCounter(CommunicationTool.READ_FAILED_RECORDS)); lastCommunication.setTimestamp(nowTimestamp); } }

可以看到把内容都设置进Communication信息载体了，这里还有其它的方法如pushAll等。大家Control鼠标点一下就能trace整个调用链了，其实就是不同的插件调用触发Exchanger方法，然后在Exchanger里面调用Channel的方法来记录到Communication信息载体。

04 文末

好了，到此把DataX的监控与汇报功能讲解完毕了，有疑问的童鞋欢迎留言，谢谢大家的阅读，本文完！

DataX教程（08）- 监控与汇报

[ 申请 ]友情链接：