hadoop、hive搭建及其处理千万级csv文件

搭建

1.环境基础

1. linux64位
2. JDK 1.7+

2.网络配置

　　1.修改网卡

　　　　vi /etc/sysconfig/nework-script/ifcfg-eth0

　　2.重启网卡

　　　　service network restart

3.关闭防火墙

# 临时关闭防火墙
service iptables stop
# 永久关闭防火墙
chkconfig iptables off

4.修改主机名

vi /etc/sysconfig/network

5.主机名映射

windows
C:WindowsSystem32driversetchosts
192.168.64.10 hadoop
linux
vi /etc/hosts

6.关闭Selinux

Selinux 是红帽子的安全套件 RedHat = CentOS
vi /etc/selinux/config
SELINUX=disabled

7.规范安装目录

#原始安装文件
/opt/models
#安装目录
1. 如果软件有默认安装目录则默认安装 /usr
2. 如果需要指定安装目录 统一放置在 /opt/install文件夹

8.安装jdk。省略

9.开始安装hadoop

#hadoop的单机式环境
1. 上传hadoop2.5.2压缩包 /opt/models
2. 解压缩 /opt/install
3. 配置文件配置 etc/hadoop
    1. hadoop-env.sh
	export JAVA_HOME=/usr/java/jdk1.7.0_71
2. core-site.xml
	<property>
		<name>fs.default.name</name>
		<value>hdfs://hadoop1.baizhiedu.com:8020</value>
	</property>
	<property>
		<name>hadoop.tmp.dir</name>
		<value>/opt/install/hadoop-2.5.2/data/tmp</value>
	</property>
3. hdfs-site.xml
	<property>
		<name>dfs.replication</name>
		<value>1</value>
	</property>
4. yarn-site.xml
	<property>
		<name>yarn.nodemanager.aux-services</name>
		<value>mapreduce_shuffle</value>
	</property>
5. mapred-site.xml
	<!--改名 mapred-site.xml.template 该名称 mapred-site.xml-->
	<property>
		<name>mapreduce.framework.name</name>
		<value>yarn</value>
	</property>
4. namenode的格式化
相对于hadoop安装路径 /opt/install/hadoop-2.5.2
bin/hdfs namenode -format
5. 启动hadoop的后台进程
sbin/hadoop-daemon.sh start namenode
sbin/hadoop-daemon.sh start datanode
sbin/yarn-daemon.sh start resourcemanager
sbin/yarn-daemon.sh start nodemanager
6. 验证效果
ps -ef | grep java
jps #底层 jdk提供 javac java javadoc
http://hadoop:50070

安装hive

1. 搭建Hadoop
2. Hive安装 加压缩
3. 配置
hive-env.sh
# Set HADOOP_HOME to point to a specific hadoop install directory
HADOOP_HOME=/opt/install/hadoop-2.5.2
# Hive Configuration Directory can be controlled by:
export HIVE_CONF_DIR=/opt/install/apache-hive-0.13.1-bin/conf
4. 在hdfs 创建 /tmp
/user/hive/warehouse
5. 启动hive

hive的基本使用

1. hive数据库
show databases;
create database if not exists lhc;
use lhc;
2. 表相关操作
show tables;
create table if not exists t_user(
id int,
name string
)row format delimited fields terminated by '	';
3. 插入数据 导入数据 本地操作系统文件 向 hive表 导入数据
load data local inpath '/root/data3' into table t_user;
4. SQL语句
select * from t_user;

将csv文件导入hive并进行统计分析

1. csv是逗号分隔符的文件，首先创建hive表
create table if not exists t_db2(
eventname string,
visitdate string,
role string,
grade string,
school string,
area string,
user_id string,
sessionid string,
eqid string,
visittime string,
ipaddr string,
ipfirst string,
ipsecond string,
dbkcnj string,
subject string,
kcname string,
xh string
)row format delimited fields terminated by ',';

2. 将csv文件导入hive表
LOAD DATA LOCAL INPATH '/opt/install/db.csv' OVERWRITE INTO TABLE t_db2;
3.写hql查询统计
select dbkcnj,subject,area,role,count(*) as pv from t_db2 group by dbkcnj,subject,area,role;
其中db.csv是公司系统产生的点播数据，原始数据四千万条，10G大小。导入mysql去查询相当耗时，而导入hive查询只需100零几秒。（本人自己用本地VMware建的4G运存，9代i5cpu的虚拟机）
千万级数据hive要比mysql处理效率要高，有人做过测试，详见：https://www.cnblogs.com/hsia2017/p/10765272.html