Sqoop_mysql,hive,hdfs导入导出操作

前言: 搭建环境,这里使用cdh版hadoop+hive+sqoop+mysql

下载 hadoop-2.5.0-cdh5.3.6.tar.gz
　　 hive-0.13.1-cdh5.3.6.tar.gz
　　 sqoop-1.4.5-cdh5.3.6.tar.gz
配置 Hadoop
　　 *.env(3个)--jdk_Path
　　 core-sit.xml
　　　　fs.defaultFS
　　　　hadoop.tmp.dir
　　 hdfs-site.xml 
　　　　dfs.replication
　　　　mapred-site.xml
　　　　mapreduce.framework.name--yarn
　　　　mapreduce.jobhistory.address        # 10020
　　　　mapreduce.jobhistory.webapp.address # 19888
　　 yarn-site.xml
　　　　yarn.resourcemanager.hostname
　　　　yarn.nodemanager.aux-services--mapreduce_shuffle
　　　　yarn.log-aggregation-enable--true
　　　　yarn.log-aggregation.retain-seconds--108600
　　 slave
　　　　主机地址
PS: 格式化namenode,启动hdfs与yarn
　　 $ bin/hdfs dfs -mkdir   /tmp
　　 $ bin/hdfs dfs -mkdir -p /user/hive/warehouse
　　 $ bin/hdfs dfs -chmod g+w  /tmp
　　 $ bin/hdfs dfs -chmod g+w   /user/hive/warehouse
配置Hive
　　 hive-env.sh
　　　　HADOOP_HOME=/opt/cdh-5.6.3/hadoop-2.5.0-cdh5.3.6
　　　　export HIVE_CONF_DIR=/opt/cdh-5.6.3/hive-0.13.1-cdh5.3.6/conf
　　　　hive-log4j.properties
　　　　hive.log.threshold=ALL
　　　　hive.root.logger=INFO,DRFA
　　　　hive.log.dir=/opt/cdh-5.6.3/hive-0.13.1-cdh5.3.6/logs
　　　　hive.log.file=hive.log
　　 hive-site.xml # 事先将mysql部署好
　　　　javax.jdo.option.ConnectionURL--jdbc:mysql://hadoop09-linux-01.ibeifeng.com:3306/chd_metastore?createDatabaseIfNotExist=true
　　　　javax.jdo.option.ConnectionDriverName--com.mysql.jdbc.Driver
　　　　javax.jdo.option.ConnectionUserNam
　　　　javax.jdo.option.ConnectionPassword
　　　　hive.cli.print.header--true
　　　　hive.cli.print.current.db--true
　　　　hive.fetch.task.conversion--more
PS: hive目录下
　　 $ mkdir logs
将准备好的mysql.jar包放入lib
　　 启动 $ bin/hive
配置Sqoop
　　 sqoop-env.sh
　　　　export HADOOP_COMMON_HOME=/opt/cdh-5.6.3/hadoop-2.5.0-cdh5.3.6
　　　　export HADOOP_MAPRED_HOME=/opt/cdh-5.6.3/hadoop-2.5.0-cdh5.3.6
　　　　export HIVE_HOME=/opt/cdh-5.6.3/hive-0.13.1-cdh5.3.6
将准备好的mysql.jar包放入lib

一、准备数据

# 在我的mysql下创建数据库和表,并插入几条数据
  mysql> create database if not exists student default character set utf8 collate utf8_general_ci;
  mysql> use student;
  mysql> create table if not exists stu_info( id int(10) primary key not null auto_increment, name varchar(20) not null) default character set utf8 collate utf8_general_ci;
  mysql> insert into stu_info(name) values("李建");
  mysql> insert into stu_info(name) values("张明");
  mysql> insert into stu_info(name) values("赵兴");
  mysql> insert into stu_info(name) values("陈琦");
  mysql> insert into stu_info(name) values("刘铭");
  mysql> select id,name from stu_info;
  +----+--------+
  | id | name   |
  +----+--------+
  |  1 | 李建   |
  |  2 | 张明   |
  |  3 | 赵兴   |
  |  4 | 陈琦   |
  |  5 | 刘铭   |
  +----+--------+
  5 rows in set (0.00 sec)

二、使用sqoop将mysql中的这张表导入到hdfs上

bin/sqoop import 
--connect 
jdbc:mysql://10.0.0.108:3306/student 
--username root 
--password root 
--table stu_info 
--target-dir /student 
--num-mappers 1 
--fields-terminated-by '	'

三、使用sqoop将mysql中的这张表导入到hive

方式一、
1. 在hive中创建数据库和表
    create database if not exists student;
    create table if not exists stu_info(id int,name string) row format delimited fields terminated by '	';
2. bin/sqoop import 
    --connect jdbc:mysql://hadoop09-linux-01.ibeifeng.com:3306/student 
    --username root --password root 
    --table stu_info 
    --delete-target-dir 
    --target-dir /user/hive/warehouse/student.db/stu_info 
    --hive-import 
    --hive-database student 
    --hive-table stu_info 
    --hive-overwrite 
    --num-mappers 1 
    --fields-terminated-by '	'
方式二、
1. 使用sqoop create-hive-table,但必须创建出自定义数据库,否则目标路径将是元数据库
2. bin/sqoop create-hive-table 、
    --connect jdbc:mysql://10.0.0.108:3306/student 、
    --username root --password root 
    --table stu_info 
    --hive-table student.stu_info
3. bin/sqoop import --connect jdbc:mysql://10.0.0.108:3306/student 
    --username root --password root 
    --table stu_info 
    --hive-import 
    --hive-database student 
    --hive-table stu_info 
    --hive-overwrite  
    --num-mappers 1 
    --fields-terminated-by '	' 
    --delete-target-dir 
    --target-dir /user/hive/warehouse/student.db/stu_info
4. 在hive中查询会发现数据全部为NULL
    但是从hdfs上查看却是正常的,确定hive无法解析数据,定位在分隔符问题
    使用--fields-terminated-by '01' 即可  # 01就是ctrl+A,hive默认分隔符,mysql默认分隔符为","

五、从hdfs或hive导出数据到mysql表

1. 在mysql上准备好数据库和表
2. 数据库我就直接使用student数据库
    create table if not exists stu_info_export like stu_info;
3. 根据hdfs/hive表数据分隔符为主
    bin/sqoop export 
    --connect jdbc:mysql://10.0.0.108/student 
    --username root --password root 
    --table stu_info_export 
    --export-dir /user/hive/warehouse/student.db/stu_info 
    --num-mappers 1 
    --input-fields-terminated-by '01'

六、sqoop --option-file

另外 企业级增量迁移数据使用 --option-file + shell脚本
-- $ sqoop import --connect jdbc:mysql://localhost/db --username foo --table TEST
-- $ sqoop --options-file /users/homer/work/import.txt --table TEST
注意:脚本格式开头直接导入导出命令然后一行一个属性,如:
-->import
   --connect
   jdbc:mysql://localhost/db
   --username
   foo

七、使用sqoop job

$ bin/sqoop job --delete <job_id>
$ bin/sqoop job --list 
$ bin/sqoop job --show <job_id>
$ bin/sqoop job --exec <job_id>
$ bin/sqoop job --create job_id -- <job-info>
$ bin/sqoop job --create stu_info -- 
 import 
--connect 
jdbc:mysql://hadoop09-linux-01.ibeifeng.com:3306/sqoop 
 --username root 
--password root 
--table tohdfs  
--target-dir /sqoop 
--num-mappers 1 
--fields-terminated-by '	' 
--check-column id  
--incremental append  
--last-value 11
PS: 增量导入(与--delete-target-dir冲突)
    --check-column id 
    --incremental append/lastmodified(时间戳的更改) 
    --last-value 11
另外:
    --columns field1,field2,field3     
    --query <ql> # 需要加 $CONDITIONS,且不能和--table连用
    --where <where xxx> # 无需加$CONDITIONS