1-sqoop

————————————————————————————————————————————————————————————————————
**********sqoop1.4.7**********************************************
————————————————————————————————————————————————————————————————————    
    
1、Sqoop底层是使用mapreduce实现的，但是只是用到了map阶段，没有用到reduce阶段

    思考：为什么sqoop使用mapreduce底层来实现？ 

        答：Mapreduce是一个分布式计算框架，传输海量数据的时候效率更高

    思考2：为什么sqoop只使用了map阶段没使用redue阶段？ 

        答：Sqoop 仅仅是做数据传输，并涉及到计算，所以没有使用到redcue

    思考3：sqoop并发导致的数据重复插入？

        答：
            ①sqoop本质是一个map阶段，将map数设置为1
            ②设置为更新插入

    思考4：增量导入怎么导？

        ①按照sql语句，挑选出昨天数据，进行导入，这样实现每日增量
        ②三个参数

            –check-column   检查列，一般是主键或者时间字段 
            –incremental，用来指定增量导入的模式（Mode），append和lastmodified 
            –last-value，指定上一次导入中检查列指定字段最大值

            eg:

                    --check-column id                     //检查列
                    --incremental append                 //是否是追加
                    --last-value 8                    //检查列的值

                    mysql中test数据库to_hdf表中id大于8的才会被追加到/to_hdfs2 下的文件中

            如果要求不是增量数据，而是更新数据呢？

                    将check-column设置为更新时间字段，然后设置last-value为一个时间值，那么大于这个时间的都会被导入

2、Sqoop

    意义：
        Apache Hadoop和结构化数据存储（例如关系数据库）之间高效地传输批量数据

    架构：

        Sqoop使用MapReduce导入和导出数据，这提供了并行操作以及容错能力
————————————————————————————————————————————————————————————————————————————————
*****************四种常见工具****************************************************
————————————————————————————————————————————————————————————————————————————————
1、帮助工具
———————————————————————————————————————————————————————————————————————————-—————
    sqoop help
    sqoop import --help
————————————————————————————————————————————————————————————————————————————————
2、导入工具
———————————————————————————————————————————————————————————————————————————-————    
    ①公共参数
    ———————————————————————————————————————————————————————————————————————————
            任务数：
            ———————————————————————————————————————————————————————————————————            
                -m 4            //当-m 设置的值大于1时，split-by必须设置INT字段
                --split-by         //split-by即便是int型，若不是连续有规律递增的话，各个map分配的数据是不均衡的
            ———————————————————————————————————————————————————————————————————
            过滤：
            ———————————————————————————————————————————————————————————————————                       

                --where "gender=0"
            ———————————————————————————————————————————————————————————————————
            Sql：
            ———————————————————————————————————————————————————————————————————                        
                sqoop import 
                --connect jdbc:mysql://localhost:3306/test 
                --username root 
                --password 123456 
                --delete-target-dir 
                --target-dir /test/person-mysql 
                -m 1 
                --query "select * from person where name='003' and gender=0 and $CONDITIONS"
            ———————————————————————————————————————————————————————————————————
            压缩：
            ———————————————————————————————————————————————————————————————————    
                -z,--compress
                --compression-codec org.apache.hadoop.io.compress.SnappyCodec        //默认gzip
            ———————————————————————————————————————————————————————————————————
            存储格式：
            ———————————————————————————————————————————————————————————————————    
                --as-avrodatafile
                --as-sequencefile    
                --as-textfile    
                --as-parquetfile
            ———————————————————————————————————————————————————————————————————
            分隔符：
            ———————————————————————————————————————————————————————————————————    
                --fields-terminated-by '	'
                --lines-terminated-by        //字段分割与行分隔符
            ———————————————————————————————————————————————————————————————————
            增量传输：
            ———————————————————————————————————————————————————————————————————    
                --check-column "id" 
                --incremental append 
                --last-value 5                    
            ———————————————————————————————————————————————————————————————————
            空值处理：
            ———————————————————————————————————————————————————————————————————    
                --null-string ""             //--null-string含义是 数据库中string类型的字段，当Value是NULL，替换成指定的字符
                --null-non-string "false"        //--null-non-string 含义是数据库中非string类型的字段，当Value是NULL，替换成指定字符
            ———————————————————————————————————————————————————————————————————
            读取数量：
            ———————————————————————————————————————————————————————————————————    
                --fetch-size            //一次从数据库读取多少条数据
    —————————————————————————————————————————————————————————————————————————————
    ②将单个表从RDBMS导入到HDFS（特有参数）
    —————————————————————————————————————————————————————————————————————————————            
                
            delete-target-dir                //如果目录存在就删除：/test/person-mysql
            append                    //如果目录存在就追加：/test/person-mysql

                eg：
                    sqoop import 
                    --connect jdbc:mysql://localhost:3306/test 
                    --username root 
                    --password 123456 
                    --table person 
                    --append
                    --target-dir /test/person-mysql
    ———————————————————————————————————————————————————————————————————————
    ④将单个表导入到Hive（特有参数）
    ———————————————————————————————————————————————————————————————————————            

            --hive-overwrite     覆盖Hive表中的现有数据。
            --create-hive-table  创建表，如果表存在，报错
            --hive-table <table-name>    设置导入Hive时要使用的表名。

                eg(不分区):
                    sqoop import 
                    --connect jdbc:mysql://192.168.56.121:3306/metastore 
                    --username hiveuser 
                    --password redhat 
                    --table TBLS 
                    --fields-terminated-by "	" 
                    --lines-terminated-by "
" 
                    --hive-import 
                    --hive-overwrite 
                    --create-hive-table 
                    --hive-table dw_srclog.TBLS 
                    --delete-target-dir

            --hive-drop-import-delims    导入到Hive时，从字符串字段中 删除 n， r和 01。
            --hive-delims-replacement    导入到Hive时，将字符串字段中的 n， r和 01 替换为用户定义的字符串

            --hive-partition-key        要分区的配置单元字段的名称被分片
            --hive-partition-value <v>    用作此作业的分区键的字符串值导入到此作业中的蜂巢中
            
                eg(静态分区):

                    sqoop import                                             
                    --connect jdbc:oracle:thin:@127.0.0.1:1521:orcl         
                    --username test                                         
                    --password 123456                                       
                    --columns "viewTime,userid,page_url,referrer_url,ip"    
                    --hive-partition-key "dt"                               
                    --hive-partition-value "2018"                           
                    --query "SELECT viewTime,userid,page_url,referrer_url,ip from page_view  WHERE 1=1 and $CONDITIONS" 
                    --hive-table test.page_view                             
                    --hive-drop-import-delims                                
                    --target-dir "/data/test/page_view"                     
                    --hive-overwrite                                        
                    --null-string '\N'                                     
                    --null-non-string '\N'                                 
                    --hive-import;

                    注：需要提前建立分区，ALTER TABLE page_view add PARTITION (dt=2018)

            注：sqoop导入数据不支持动态分区、多分区（以天、时等多个字段分区），必须抽到hdfs、hive临时表，用hive sql实现
————————————————————————————————————————————————————————————————————————————————
3、导出工具
———————————————————————————————————————————————————————————————————————————-————

    sqoop-export（由于导出是按照目录导的，所以hive、hdfs导出语法一样）

        ①导出分类：

            插入：
                每行输入记录都被转换成一条INSERT语句，如果数据库中的表具有约束条件（例如，其值必须唯一的主键列）并且已有数据存在，导出过程将失败
                因为单纯的插入容易失败，所以不推荐使用

            更新：
                --update-mode
                        // updateonly（默认值）,只更新，不插入
                        //allowinsert，更新并且允许插入
                --update-key


        ②将hive/hdfs数据导入到关系型数据库：

            --input-null-string '\N'        //input-null-string含义是 hive中string类型的字段，当Value是NULL，替换成指定的字符
            --input-null-non-string '\N'        //--input-null-non-string 含义是hive中非string类型的字段，当Value是NULL，替换成指定字符
            
            --fields-terminated-by '	'        //字段分隔符
            --lines-terminated-by

            -m 4                //根据目录下的文件数进行并行任务导出，不需要指定int字段
            --export-dir
        
            
                eg(数据库中的表必须存在，表结构也要相同):

                    sqoop export 
                    --connect jdbc:mysql://127.0.0.1:3306/market 
                    --username admin 
                    --password 123456 
                    --table MySQL_Test_Table 
                    --export-dir /user/hive/pms/yhd_categ_prior 
                    --update-mode allowinsert 
                    --update-key category_id 
                    --fields-terminated-by '01' 
                    --lines-terminated-by '
'
                    --input-null-string '\N'
                    --input-null-non-string '\N'
————————————————————————————————————————————————————————————————————————————————
4、JOB工具
———————————————————————————————————————————————————————————————————————————-————
    将一些参数配置以及命令语句保存起来，方便调用

    创建job
        sqoop job 
        --create sqoopimport1 
        -- import 
        --connect jdbc:mysql://localhost:3306/sqooptest 
        --username root 
        -password 123qwe 
        --table sqoop_job
    查看job
        sqoop job -list   //  sqoop job --show jobname
    执行job
        sqoop job --exec sqoopimport1
    删除job
        sqoop job --delete your_sqoop_job_name