CDH5.5.6下R、RHive、RJava、RHadoop安装测试

部署机器
NameNode1
NameNode2
DataNode1
DataNode2
DataNode3

R安装目录
/usr/local/lib64/R
RStudio Server安装目录
/usr/lib/rstudio-server


R安装步骤
1.编译前确保安装如下模块,每台机器都要执行
yum install gcc-gfortran gcc gcc-c++ libXt-devel openssl-devel readline-devel glibc-headers

2.安装R语言(各个节点都要安装)
解压
tar -zxvf R-3.2.0.tar.gz
编译
cd R-3.2.0
./configure --prefix=/usr/local --disable-nls --enable-R-shlib  #两个选项--disable-nls --enable-R-shlib是为RHive的安装座准备,如果不安装RHive可以省去。
make
make install
其中readline-devel、libXt-devel在编译R的时候需要,而--enable-R-shlib是安装R的共享库,在安装Rstudio需要。

3.确认Java环境变量
RHadoop依赖于rJava包,安装rJava前确认已经配置了Java环境变量,然后进行R对jvm建立连接。
R CMD javareconf

4.进行rJAVA 、RHive 等模块的安装
R CMD INSTALL rJava_0.9-6.tar.gz
R CMD INSTALL Rserve_1.8-3.tar.gz
R CMD INSTALL RHive_2.0-0.10.tar.gz

5.配置RHIVE
新建RHIVE 数据存储路径(本地的非HDFS)
我这里保存在 /www/store/rhive/data
mkdir -p /www/store/rhive/data

新建Rserv.conf 文件并写入 “remote enable” 保存到你指定的目录
我这里存放在 /www/cloud/R/Rserv.conf
mkdir -p /www/cloud/R
vi /www/cloud/R/Rserv.conf

修改各个节点以及master 的 /etc/profile 新增环境变量
export RHIVE_DATA=/www/store/rhive/data

将R目录下的lib目录中所有文件上传至HDFS 中的/rhive/lib 目录下(如果目录不存在手工新建一下即可)
cd /usr/local/lib64/R/lib
hadoop fs -put ./* /rhive/lib

6.启动
在所有节点和master上执行
R CMD Rserve --RS-conf /www/cloud/R/Rserv.conf
telnet NameNode1 6311
telnet NameNode2 6311
telnet DataNode1 6311
telnet DataNode2 6311
telnet DataNode3 6311

telnet无法使用执行下面语句安装
yum install telnet-server 安装telnet服务
yum install telnet.* 安装telnet客户端

然后在Master节点telnet所有slave节点,显示 Rsrv0103QAP1 则表示连接成功

启动hive远程服务: rhive是通过thrift连接hiveserver的,需要要启动后台thrift服务,即:在hive客户端启动hive远程服务,如果已经开启了跳过本步骤
nohup hive --service hiveserver &

7.Rhive 测试
library(RHive)
rhive.init
初始化报错未解决
function (hiveHome = NULL, hiveLib = NULL, hadoopHome = NULL,
hadoopConf = NULL, hadoopLib = NULL, verbose = FALSE)
{
tryCatch({
.rhive.init(hiveHome = hiveHome, hiveLib = hiveLib, hadoopHome = hadoopHome,
hadoopConf = hadoopConf, hadoopLib = hadoopLib, verbose = verbose)
}, error = function(e) {
.handleErr(e)
})
}
<environment: namespace:RHive>

rhive.connect(host ="172.16.9.32")
连接报错未解决
Warning:
+----------------------------------------------------------+
+ / hiveServer2 argument has not been provided correctly. +
+ / RHive will use a default value: hiveServer2=TRUE. +
+----------------------------------------------------------+

但是读取数据成功了
d <- rhive.query('select * from src.v_mzdm limit 1000')

RStudio Server需要设置环境变量
Sys.setenv("HIVE_HOME"="/opt/cloudera/parcels/CDH-5.5.6-1.cdh5.5.6.p0.2/lib/hive")
Sys.setenv("HADOOP_HOME"="/opt/cloudera/parcels/CDH-5.5.6-1.cdh5.5.6.p0.2/lib/hadoop")


8.Rhadoop安装配置按顺序执行有依赖关系
R CMD INSTALL Rcpp_0.12.17.tar.gz
R CMD INSTALL plyr_1.8.3.tar.gz
R CMD INSTALL stringi_1.2.3.tar.gz
R CMD INSTALL glue_1.2.0.tar.gz
R CMD INSTALL magrittr_1.5.tar.gz
R CMD INSTALL stringr_1.3.0.tar.gz
R CMD INSTALL reshape2_1.4.2.tar.gz
R CMD INSTALL iterators_1.0.9.tar.gz
R CMD INSTALL itertools_0.1-1.tar.gz
R CMD INSTALL digest_0.6.14.tar.gz
R CMD INSTALL RJSONIO_1.2-0.2.tar.gz
R CMD INSTALL functional_0.4.tar.gz
R CMD INSTALL bitops_1.0-5.tar.gz
R CMD INSTALL caTools_1.17.tar.gz
R CMD INSTALL Cairo_1.5-10.tar.gz 需要先执行yum -y install cairo* libxt*

依赖包下载路径https://cran.r-project.org/src/contrib/Archive/

9.安装RHadoop软件包

首先将下面的变量加入到环境变量中:

vi /etc/profile
export HADOOP_CMD=/opt/cloudera/parcels/CDH-5.5.6-1.cdh5.5.6.p0.2/bin/hadoop
export HADOOP_STREAMING=/opt/cloudera/parcels/CDH-5.5.6-1.cdh5.5.6.p0.2/jars/hadoop-streaming-2.6.0-cdh5.5.6.jar
export JAVA_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.5.6-1.cdh5.5.6.p0.2/lib/hadoop/lib/native
source /etc/profile #保存生效
安装
R CMD INSTALL rhdfs_1.0.8.tar.gz
R CMD INSTALL rmr2_3.3.1.tar.gz    #各个节点都要安装
报错-网上说是rmr2_3.3.1.tar.gz的编译问题未解决
Copying libs into local build directory
find: `/usr/lib/hadoop': No such file or directory
ls: cannot access /opt/cloudera/parcels/CDH-5.5.6-1.cdh5.5.6.p0.2/lib/hadoop/hadoop-*-core.jar: No such file or directory
ls: cannot access /opt/cloudera/parcels/CDH-5.5.6-1.cdh5.5.6.p0.2/lib/hadoop/hadoop-core-*.jar: No such file or directory
Cannot find hadoop-core jar file in hadoop home
cp: cannot stat `build/dist/*': No such file or directory
can't build hbase IO classes, skipping
installing to /usr/local/lib64/R/library/rmr2/libs
** R
** byte-compile and prepare package for lazy loading
Warning in library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :
there is no package called ‘quickcheck’
Note: no visible binding for '<<-' assignment to '.Last'
Note: no visible binding for '<<-' assignment to '.Last'
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
Warning: S3 methods ‘gorder.default’, ‘gorder.factor’, ‘gorder.data.frame’, ‘gorder.matrix’, ‘gorder.raw’ were declared in NAMESPACE but not found
* DONE (rmr2)

网上提供解决方法
http://www.dataguru.cn/thread-135199-1-1.html


再将native下面的libhadoop.so.0 及 libhadoop.so.1.0.0拷贝到 /usr/lib64下面:
cp libhadoop.so /usr/lib64/
cp libhadoop.so.1.0.0 /usr/lib64/

验证一下rhdfs、rmr2的功能

测试hdfs
library(rhdfs)
hdfs.init()
hdfs.ls("/")

rmr2的功能有问题,安装时报错没处理掉


#R
export R_HOME=/usr/local/lib64/R
export HADOOP_CMD=/opt/cloudera/parcels/CDH-5.5.6-1.cdh5.5.6.p0.2/bin/hadoop
export HADOOP_STREAMING=/opt/cloudera/parcels/CDH-5.5.6-1.cdh5.5.6.p0.2/jars/hadoop-streaming-2.6.0-cdh5.5.6.jar
export JAVA_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.5.6-1.cdh5.5.6.p0.2/lib/hadoop/lib/native
export RHIVE_DATA=/www/store/rhive/data
export HIVE_HOME=/opt/cloudera/parcels/CDH-5.5.6-1.cdh5.5.6.p0.2/lib/hive
export HADOOP_HOME=/opt/cloudera/parcels/CDH-5.5.6-1.cdh5.5.6.p0.2/lib/hadoop
export PATH=$PATH:$JAVA_HOME/bin:$ANT_HOME/bin:$R_HOME/bin


RStudio Server安装步骤
yum install --nogpgcheck rstudio-server-rhel-1.1.456-x86_64.rpm
cd /usr/lib/rstudio-server/bin
./rstudio-server start
访问ip:8787

系统设置
主要有两个配置文件,默认文件不存在
/etc/rstudio/rserver.conf
/etc/rstudio/rsession.conf

设置端口和ip控制:
vi /etc/rstudio/rserver.conf
www-port=8080#监听端口
www-address=127.0.0.0#允许访问的IP地址,默认0.0.0.0
重启服务器,生效
rstudio-server restart

会话配置管理
vi /etc/rstudio/rsession.conf
session-timeout-minutes=30#会话超时时间
r-cran-repos=http://ftp.ctex.org/mirrors/CRAN#CRAN资源库

系统管理

rstudio-server start #启动
rstudio-server stop #停止
rstudio-server restart #重启

查看运行中R进程
rstudio-server active-sessions
指定PID,停止运行中的R进程
rstudio-server suspend-session <pid>
停止所有运行中的R进程
rstudio-server suspend-all
强制停止运行中的R进程,优先级最高,立刻执行
rstudio-server force-suspend-session <pid>
rstudio-server force-suspend-all
RStudio Server临时下线,不允许web访问,并给用户友好提示
rstudio-server offline
RStudio Server临时上线
rstudio-server online

只可以用普通用户登录
创建用户和密码
useradd -d /home/r -m r
passwd r

测试
x <- c(1,2,5,7,9)
y <- c(2,4,7,8,10)
library(Cairo)
CairoPNG(file="pic_plot.png", width=640, height=480)
plot(x,y)

RStudio Server中读不到环境变量需要自己设置
Sys.setenv("HADOOP_CMD"="/opt/cloudera/parcels/CDH-5.5.6-1.cdh5.5.6.p0.2/bin/hadoop")

读取hdfs上数据
library(rJava)
library(rhdfs)
hdfs.init()
hdfs.ls("/")
hdfs.cat("/user/kjxydata/src/V_MZDM/v_mzdm.txt")


rmr2测试
1.MapReduce的R语言程序:
small.ints = to.dfs(1:10)

mapreduce(input = small.ints, map = function(k, v) cbind(v, v^2))
报错-可能是rmr2没安装好

from.dfs("/tmp/RtmpWnzxl4/file5deb791fcbd5")

因为MapReduce只能访问HDFS文件系统,先要用to.dfs把数据存储到HDFS文件系统里。MapReduce的运算结果再用from.dfs函数从HDFS文件系统中取出。


2.rmr的例子是wordcount,对文件中的单词计数
input<- '/user/kjxydata/src/V_MZDM/v_mzdm.txt'

wordcount = function(input, output = NULL, pattern = " "){
wc.map = function(., lines) {
keyval(unlist( strsplit( x = lines,split = pattern)),1)
}
wc.reduce =function(word, counts ) {
keyval(word, sum(counts))
}
mapreduce(input = input ,output = output, input.format = "text",
map = wc.map, reduce = wc.reduce,combine = T)
}

wordcount(input)
报错-可能是rmr2没安装好

from.dfs("/tmp/RtmpfZUFEa/file6cac626aa4a7")

安装参考
https://www.cnblogs.com/end/archive/2013/02/18/2916105.html
https://www.cnblogs.com/hunttown/p/5470652.html
https://www.cnblogs.com/hunttown/p/5470805.html
https://blog.csdn.net/youngqj/article/details/46819625

原文地址:https://www.cnblogs.com/liquan-anran/p/9429376.html