1、讀取hdfs目錄：hadoop dfs -ls path相當於listStatus的簡寫

//checkpoint目錄是：/user/dmspark/accumulate/checkpoint
//e.g. /user/dmspark/accumulate/checkpoint/0519936a-5bff-4ecf-a6f0-3854e5952ec9/rdd-689/part-00099

private def getLatestCheckpoint(checkpointDir: String): Option[String] = {
    val fs = FileSystem.get(new Configuration())
    val latestCheckpointDir = findLatestSubDir(new Path(checkpointDir))
    var latestCheckpointRDDDir = findLatestSubDir(latestCheckpointDir)
    //如果最近的checkpoint文件不全，那麼就用前5分鐘的checkpoint數據
    if(fs.listStatus(latestCheckpointRDDDir).length != NumPartitionsOfReducedRDD + 1)
      latestCheckpointRDDDir = findLatestSubDir(latestCheckpointDir, 1)
    if(latestCheckpointRDDDir != null) Some(latestCheckpointRDDDir.toString) else None
}

def findLatestSubDir(path: Path, index: Int = 0): Path = {
   if(path == null || !fs.exists(path))
     return null
   val fileStatus = fs.listStatus(path).sortBy(_.getModificationTime).reverse
   if(fileStatus.length <= index) null else fileStatus(index).getPath
}

2、讀取hdfs文件到內存的list

  def readToList(file: Path): List[String] = {
    val uRI = "hdfs://localhost:8021"
    val configuration = new Configuration()
    val hdfs: FileSystem = FileSystem.get(URI.create(uRI), configuration)
    val in: FSDataInputStream = hdfs.open(file)
    val reader: BufferedReader = new BufferedReader(new InputStreamReader(in, "UTF8"))
    var line = ""
    val list = new ListBuffer[String]

    breakable(
      while ((line = reader.readLine) != null) {
        if (line == null) {
          break()
        }
        list += line
      }
    )
    list.toList
  }
 

  def main(args: Array[String]): Unit = {
    val path = new Path("/user/dmspark/product3_source_conf/catecode_info.txt")
    println(readToList(path).size)
  }

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

scala操作hdfs

1、讀取hdfs目錄：hadoop dfs -ls path相當於listStatus的簡寫

2、讀取hdfs文件到內存的list

985 碩士程序員，空窗 4 個月沒有 Offer！

營銷系統黑名單優化：位圖的應用解析

一文搞懂 Spring 循環依賴

我真的從測試轉成了開發......

nginx添加相應配置，通過瀏覽器訪問或curl時返回客戶端對應公網IP

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

python內置函數——sorted

[oeasy]python020在遊戲中體驗數值自由_勇闖地下城_終端文字遊戲

爲何我建議你學會抄代碼

抖音面試：說說延遲任務的調度算法？

arrayList擴容內部是怎麼實現的

spark算子：RDD的轉化操作groupbykey

Java的堆和棧中對象的區別

scala中的break和continue

spark CheckPoint的寫流程源碼分析

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結