Linux下搭建PySpark环境

linux版scala:https://downloads.lightbend.com/scala/2.11.0/scala-2.11.0.tgz
linux/windows通用版spark:https://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
linux/windows通用版hadoop:https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz

安装spark:
tar -zxvf ./spark-2.4.3-bin-hadoop2.7.tgz -C ./spark
export SPARK_HOME=/home/service/spark-2.4.5-bin-hadoop2.7
export PATH=$SPARK_HOME/bin:$PATH

安装hadoop:
tar -zxvf ./hadoop-2.6.0-cdh5.8.5.tar.gz -C ./hadoop
export HADOOP_HOME=/home/service/hadoop-2.7.7
export PATH=$HADOOP_HOME/bin:$PATH

安装scala:
tar -zxvf ./scala-2.13.0.tgz -C ./scala
export SCALA_HOME=/home/service/scala-2.11.0
export PATH=$SCALA_HOME/bin:$PATH
source ~/.bashrc

安装pyspark:
pip install pyspark

参考:

https://www.cnblogs.com/traditional/p/11297049.html

https://juejin.im/post/5cd16c00e51d453a51433062

linux下spark-submit遇到的问题:

--master local模型下通过conf.get函数获取spark中driver和executor相关的属性值均为None
spark-submit任务--master yarn模式报错:
问题1:Exception in thread "main" org.apache.spark.SparkException: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
Linux下解决办法:(spark-env.sh脚本只在linux下有效,在windows不起作用)
解决方法:编辑$SPARK_HOME/conf/spark-env.sh文件,加入
export HADOOP_CONF_DIR={HADOOP_HOME}/etc/hadoop/
详见:https://blog.csdn.net/woai8339/article/details/80596441

问题2:WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
解决:spark安装目录/conf/spark-env.sh增加以下行:
export LD_LIBRARY_PATH=${HADOOP_HOME}/lib/native
详见:https://www.bbsmax.com/A/Vx5M9WlL5N/

问题3:warn util.utils::service 'sparkUI' can not bind on part 4040.Attempting port 4041
可能原因:sc实例并非在主程序中创建,引入其他创建sc的子文件时,导致程序中存在多个sc实例,而报此错.
解决:子文件中传入sc实例,然后在主程序文件中创建sc实例
详见:https://blog.csdn.net/weixin_41629917/article/details/83190258

原文地址:https://www.cnblogs.com/luckyboylch/p/12567710.html