Spark学习记录

   Spark学习中遇到的一些基本问题以及解决思路。(谢谢各位大佬的经验)

  1. 读取scv文件第一行做表头
    rating = spark.read.option('header','true').csv('file:///home/twain/sparkTest/ml-latest-small/ratings.csv')
  2. 一个简单的Spark创建和运行流程,统计词汇
    from pyspark import SparkContext,SparkConf
    from pyspark.sql import SparkSession
    spark = SparkSession.builder.config(conf=SparkConf()).getOrCreate()
    
    
    spark = SparkSession.builder.appName('StructString').getOrCreate()
    spark.sparkContext.setLogLevel('WARN')
    lines = spark.readStream.format("socket").option("host", "localhost") 
            .option("port", 9999) 
            .load()
    words = lines.select(
            explode(
                split(lines.value, " ")
            ).alias("word")
    )
    wordCounts = words.groupBy("word").count()
    query = wordCounts 
            .writeStream 
            .outputMode("complete") 
            .format("console") 
            .trigger(processingTime="8 seconds") 
            .start()
     
    query.awaitTermination()
  3. 读取scv时创建列表名
    rating = spark.read.csv('file:///home/twain/sparkTest/computerdata/data/ratings_Computers.csv').toDF('userId','itemId','score','create_time')
  4. 改变Spark DataFrame中列的类型
    http://mini.eastday.com/mobile/191108004955918.html
  5.  spark机器学习ALS原理(一)

    https://www.cnblogs.com/xiguage119/p/10813393.html
    https://blog.csdn.net/qq_37181642/article/details/102739855
  6. Spark ML关于模型保存、模型加载案例
    https://blog.csdn.net/wangwei_5201314/article/details/89641800
  7. Spark机器学习库评估标准总结
    https://blog.csdn.net/u011707542/article/details/77838588
  8. Spark SQL数据类型转换
    https://blog.csdn.net/an1090239782/article/details/102541024
  9. 基于PySpark和ALS算法实现基本的电影推荐流程
    https://blog.csdn.net/pysense/article/details/103880967
  10. Pycharm+PySpark远程调试的环境配置的方法
    http://www.manongjc.com/article/20160.html
  11. PySpark机器学习库ML入门
    https://www.jianshu.com/p/20456b512fa7
  12. VMware 虚拟机下为Ubuntu配置静态IP(NET方式)
    https://www.cnblogs.com/liermao12/p/6079471.html
  13. Spark SQL使用explde展开嵌套的JSON数据
    https://blog.csdn.net/strongyoung88/article/details/52227568
  14. PySpark调用Python第三方库出现ImportEoor:No module named... 问题
    https://blog.csdn.net/lhx_xhl/article/details/85225968?depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromBaidu-1&utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromBaidu-1

原文地址:https://www.cnblogs.com/emmm/p/13063911.html