Spark 常用案例

数据过滤清洗数据

 //textFile() 加载数据
    val data = sc.textFile("/spark/seven.txt")
 
    //filter 过滤长度小于0, 过滤不包含GET与POST的URL 
    val filtered = data.filter(_.length() > 0).filter(line => (line.indexOf("GET") > 0 || line.indexOf("POST") > 0))
 
    //转换成键值对操作
    val res = filtered.map(line => {
      if (line.indexOf("GET") > 0) { //截取 GET 到URL的字符串
        (line.substring(line.indexOf("GET"), line.indexOf("HTTP/1.0")).trim, 1)
      } else { //截取 POST 到URL的字符串
        (line.substring(line.indexOf("POST"), line.indexOf("HTTP/1.0")).trim, 1)
      } //最后通过reduceByKey求sum
    }).reduceByKey(_ + _)
 
    //触发action事件执行
    res.collect()
View Code

分析每年的最高温度

原始数据分析

0067011990999991950051507004888888889999999N9+00001+9999999999999999999999

0067011990999991950051512004888888889999999N9+00221+9999999999999999999999

0067011990999991950051518004888888889999999N9-00111+9999999999999999999999

0067011990999991949032412004888888889999999N9+01111+9999999999999999999999

0067011990999991950032418004888888880500001N9+00001+9999999999999999999999

0067011990999991950051507004888888880500001N9+00781+9999999999999999999999

数据说明: 

第15-19个字符是year

第45-50位是温度表示,+表示零上 -表示零下,且温度的值不能是9999,9999表示异常数据

第50位值只能是0、1、4、5、9几个数字

val one = sc.textFile("/tmp/hadoop/one")
val yearAndTemp = one.filter(line => {
      val quality = line.substring(50, 51);
      var airTemperature = 0
      if (line.charAt(45) == '+') {
        airTemperature = line.substring(46, 50).toInt
      } else {
        airTemperature = line.substring(45, 50).toInt
      }
      airTemperature != 9999 && quality.matches("[01459]")
    }).map {
      line => {
        val year = line.substring(15, 19)
        var airTemperature = 0
        if (line.charAt(45) == '+') {
          airTemperature = line.substring(46, 50).toInt
        } else {
          airTemperature = line.substring(45, 50).toInt
        }
        (year, airTemperature)
      }
    }
 
    val res = yearAndTemp.reduceByKey(
      (x, y) => if (x > y) x else y
    )
    res.collect.foreach(x => println("year : " + x._1 + ", max : " + x._2))
  }
View Code
原文地址:https://www.cnblogs.com/xiatian21/p/14941370.html