[spark] Checkpoint 源码解析

前言

在 spark 应用程序中，常常会遇到运算量很大经过很复杂的 Transformation 才能得到的 RDD 即 Lineage 链较长、宽依赖的 RDD，此时我们可以考虑将这个 RDD 持久化。

cache 也是可以持久化到磁盘，只不过是直接将 partition 的输出数据写到磁盘，而 checkpoint 是在逻辑 job 完成后，若有需要 checkpoint 的 RDD，再单独启动一个 job 去完成 checkpoint，这样该 RDD 就被计算了两次，所以建议在有 checkpoint 的时候先将该 RDD cache 到内存，到时候直接写到磁盘就行了。

checkpoint 的实现

需要使用 checkpoint 都需要通过 sparkcontext 的 setCheckpointDir 方法设置一个目录以存 checkpoint 的各种信息数据，下面我们来看看该方法：


def setCheckpointDir(directory: String) {
    if (!isLocal && Utils.nonLocalPaths(directory).isEmpty) {
      logWarning("Spark is not running in local mode, therefore the checkpoint directory " +
        s"must not be on the local filesystem. Directory '$directory' " +
        "appears to be on the local filesystem.")
    }
    checkpointDir = Option(directory).map { dir =>
      val path = new Path(dir, UUID.randomUUID().toString)
      val fs = path.getFileSystem(hadoopConfiguration)
      fs.mkdirs(path)
      fs.getFileStatus(path).getPath.toString
    }
  }

在非 local 模式下，directory 必须是 HDFS 的目录；在该目录下创建一个以 UUID 生成的一个唯一的目录名的目录。
通过 rdd.checkpoint()即可 checkpoint 此 RDD


def checkpoint(): Unit = RDDCheckpointData.synchronized { 
    if (context.checkpointDir.isEmpty) {
      throw new SparkException("Checkpoint directory has not been set in the SparkContext")
    } else if (checkpointData.isEmpty) {
      checkpointData = Some(new ReliableRDDCheckpointData(this))
    }
  }

先判断是否设置了 checkpointDir，再判断 checkpointData.isEmpty 是否成立，checkpointData 的定义是这样的：


private[spark] var checkpointData: Option[RDDCheckpointData[T]] = None

RDDCheckpointData 和 RDD 一一对应，保存着和 checkpoint 相关的信息。这里通过 new ReliableRDDCheckpointData(this)实例化了 checkpointData ，ReliableRDDCheckpointData 是其子类，这里相当于是 checkpoint 的一个标记，并没有真正执行 checkpoint。

什么时候 checkpoint

在有 action 动作时，会触发 sparkcontext 对 runJob 的调用：


def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      resultHandler: (Int, U) => Unit): Unit = {
    if (stopped.get()) {
      throw new IllegalStateException("SparkContext has been shutdown")
    }
    val callSite = getCallSite
    val cleanedFunc = clean(func)
    logInfo("Starting job: " + callSite.shortForm)
    if (conf.getBoolean("spark.logLineage", false)) {
      logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
    }
    dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
    progressBar.foreach(_.finishAll())
    rdd.doCheckpoint()
  }

我们可以看到在执行完 job 后会执行 rdd.doCheckpoint()，这里就是对前面标记了的 RDD 的 checkpoint，我们继续看这个方法：


private[spark] def doCheckpoint(): Unit = {
    RDDOperationScope.withScope(sc, "checkpoint", allowNesting = false, ignoreParent = true) {
      if (!doCheckpointCalled) {
        doCheckpointCalled = true
        if (checkpointData.isDefined) {
          if (checkpointAllMarkedAncestors) {
              dependencies.foreach(_.rdd.doCheckpoint())
          }
          checkpointData.get.checkpoint()
        } else {
      dependencies.foreach(_.rdd.doCheckpoint())
        }
      }
    }
  }

先判断是否已经被处理过 checkpoint，没有才执行，并将 doCheckpointCalled 设为 true，因为前面已经初始化过了 checkpointData，所以 checkpointData.isDefined 也满足，若想要把 checkpointData 定义过的 RDD 的 parents 也进行 checkpoint 的话，那么我们需要先对 parents checkpoint。因为，如果 RDD 把自己 checkpoint 了，那么它就将 lineage 中它的 parents 给切除了。继续跟进 checkpointData.get.checkpoint()


final def checkpoint(): Unit = {
    // Guard against multiple threads checkpointing the same RDD by
    // atomically flipping the state of this RDDCheckpointData
    RDDCheckpointData.synchronized {
      if (cpState == Initialized) {
        cpState = CheckpointingInProgress
      } else {
        return
      }
    }

    val newRDD = doCheckpoint()

    // Update our state and truncate the RDD lineage
    RDDCheckpointData.synchronized {
      cpRDD = Some(newRDD)
      cpState = Checkpointed
      rdd.markCheckpointed()
    }
  }

先将 checkpoint 的状态改为 CheckpointingInProgress，再执行 doCheckpoint，返回一个 newRDD，看 doCheckpoint 做了什么：


protected override def doCheckpoint(): CheckpointRDD[T] = {
    val newRDD = ReliableCheckpointRDD.writeRDDToCheckpointDirectory(rdd, cpDir)
    if (rdd.conf.getBoolean("spark.cleaner.referenceTracking.cleanCheckpoints", false)) {
      rdd.context.cleaner.foreach { cleaner =>
        cleaner.registerRDDCheckpointDataForCleanup(newRDD, rdd.id)
      }
    }
    logInfo(s"Done checkpointing RDD ${rdd.id} to $cpDir, new parent is RDD ${newRDD.id}")
    newRDD
  }

ReliableCheckpointRDD.writeRDDToCheckpointDirectory(rdd, cpDir)，将一个 RDD 写入到多个 checkpoint 文件，并返回一个 ReliableCheckpointRDD 来代表这个 RDD


def writeRDDToCheckpointDirectory[T: ClassTag](
      originalRDD: RDD[T],
      checkpointDir: String,
      blockSize: Int = -1): ReliableCheckpointRDD[T] = {
    val sc = originalRDD.sparkContext
    // Create the output path for the checkpoint
    val checkpointDirPath = new Path(checkpointDir)
    val fs = checkpointDirPath.getFileSystem(sc.hadoopConfiguration)
    if (!fs.mkdirs(checkpointDirPath)) {
      throw new SparkException(s"Failed to create checkpoint path $checkpointDirPath")
    }
    // Save to file, and reload it as an RDD
    val broadcastedConf = sc.broadcast(
      new SerializableConfiguration(sc.hadoopConfiguration))
    // TODO: This is expensive because it computes the RDD again unnecessarily (SPARK-8582)
    sc.runJob(originalRDD,
      writePartitionToCheckpointFile[T](checkpointDirPath.toString, broadcastedConf) _)
    if (originalRDD.partitioner.nonEmpty) {
      writePartitionerToCheckpointDir(sc, originalRDD.partitioner.get, checkpointDirPath)
    }
    val newRDD = new ReliableCheckpointRDD[T](
      sc, checkpointDirPath.toString, originalRDD.partitioner)
    if (newRDD.partitions.length != originalRDD.partitions.length) {
      throw new SparkException(
        s"Checkpoint RDD $newRDD(${newRDD.partitions.length}) has different " +
          s"number of partitions from original RDD $originalRDD(${originalRDD.partitions.length})")
    }
    newRDD
  }

获取一些配置信息广播输出等操作，然后启动一个 Job 去写 Checkpint 文件，主要由 ReliableCheckpointRDD.writeCheckpointFile 来实现写操作，写完 checkpoint 后 new 一个 ReliableCheckpointRDD 实例返回，看看具体的 writePartitionToCheckpointFile 实现：


def writePartitionToCheckpointFile[T: ClassTag](
      path: String,
      broadcastedConf: Broadcast[SerializableConfiguration],
      blockSize: Int = -1)(ctx: TaskContext, iterator: Iterator[T]) {
    val env = SparkEnv.get
    val outputDir = new Path(path)
    val fs = outputDir.getFileSystem(broadcastedConf.value.value)

    val finalOutputName = ReliableCheckpointRDD.checkpointFileName(ctx.partitionId())
    val finalOutputPath = new Path(outputDir, finalOutputName)
    val tempOutputPath =
      new Path(outputDir, s".$finalOutputName-attempt-${ctx.attemptNumber()}")

    if (fs.exists(tempOutputPath)) {
      throw new IOException(s"Checkpoint failed: temporary path $tempOutputPath already exists")
    }
    val bufferSize = env.conf.getInt("spark.buffer.size", 65536)

    val fileOutputStream = if (blockSize < 0) {
      fs.create(tempOutputPath, false, bufferSize)
    } else {
      // This is mainly for testing purpose
      fs.create(tempOutputPath, false, bufferSize,
        fs.getDefaultReplication(fs.getWorkingDirectory), blockSize)
    }
    val serializer = env.serializer.newInstance()
    val serializeStream = serializer.serializeStream(fileOutputStream)
    Utils.tryWithSafeFinally {
      serializeStream.writeAll(iterator)
    } {
      serializeStream.close()
    }

    if (!fs.rename(tempOutputPath, finalOutputPath)) {
      if (!fs.exists(finalOutputPath)) {
        logInfo(s"Deleting tempOutputPath $tempOutputPath")
        fs.delete(tempOutputPath, false)
        throw new IOException("Checkpoint failed: failed to save output of task: " +
          s"${ctx.attemptNumber()} and final output path does not exist: $finalOutputPath")
      } else {
        // Some other copy of this task must've finished before us and renamed it
        logInfo(s"Final output path $finalOutputPath already exists; not overwriting it")
        if (!fs.delete(tempOutputPath, false)) {
          logWarning(s"Error deleting ${tempOutputPath}")
        }
      }
    }
  }

这里的代码就是普通的对 HDFS 写文件的操作，将一个 RDD partition 的数据写到 checkpoint 目录下。

doCheckpoint()操作已经完成，返回了一个 new RDD:ReliableCheckpointRDD 引用给 cpRDD，接着标记 checkpoint 的状态为 Checkpointed，rdd.markCheckpointed()干了什么呢?


private[spark] def markCheckpointed(): Unit = {
    clearDependencies()
    partitions_ = null
    deps = null    // Forget the constructor argument for dependencies too
  }

最后再清除 RDD 的所有依赖。

写 checkpoint 总结

Initialized
marked for checkpointing
checkpointing in progress
checkpointed

什么时候读 checkpoint

在需要读取一个 partition 的数据时，会通过 rdd.iterator() 去计算该 rdd 的 partition 的，我们来看 RDD 的 iterator()实现：


final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
    if (storageLevel != StorageLevel.NONE) {
      getOrCompute(split, context)
    } else {
      computeOrReadCheckpoint(split, context)
    }
  }

在 cache 中没有读到数据时再判断该 RDD 是否被 checkpoint 过，isCheckpointedAndMaterialized 就是在 checkpoint 成功时的一个状态标记：cpState = Checkpointed。


private[spark] def computeOrReadCheckpoint(split: Partition, context: TaskContext): Iterator[T] =
  {
    if (isCheckpointedAndMaterialized) {
      firstParent[T].iterator(split, context)
    } else {
      compute(split, context)
    }
  }

当该 RDD 被成功 checkpoint 了，直接使用 parent rdd 的 iterator() 也就是 CheckpointRDD.iterator()，否则直接调用该 RDD 的 compute 方法。


final def dependencies: Seq[Dependency[_]] = {
    checkpointRDD.map(r => List(new OneToOneDependency(r))).getOrElse {
      if (dependencies_ == null) {
        dependencies_ = getDependencies
      }
      dependencies_
    }
  }

获取 RDD 的依赖时，会先尝试从 checkpointRDD 中获取依赖，若成功则返回被 OneToOneDependency 包装过的 ReliableCheckpointRDD 对象，否则获取真正的依赖。

[spark] Checkpoint 源码解析

前言

checkpoint 的实现

什么时候 checkpoint

写 checkpoint 总结

什么时候读 checkpoint

相关帖子

CheckPoint 防火墙概览

Spark CheckPoint 容错机制

(运维篇)- 使用 docker 搭建 hadoop-hive-spark 集群 (一)

python 从 0 编写 spark 程序

【翻译】Spark 的分区机制的应用及 PageRank 算法的实现

Spark SQL 操作 hive 过程 rename 过程时间长

Spark Streaming 实时统计数据（累加器的应用）

欢迎来到这里！