用sqoop从oracle导数据到hive的例子

用sqoop导数据到 Hive 的机制或者说步骤：
1. 先把数据导入--target-dir 指定的 HDFS 的目录中，以文件形式存储（类似于_SUCCESS, part-m-00000这种文件）
2. 在 Hive 中建表
3. 调用 Hive 的 LOAD DATA INPATH ？把 --target-dir中的数据移动到 Hive 中

这段代码实现的是，从oracle 数据库导数据到hive，数据库密码和用户名用xxx代替：

sqoop import -D mapred.job.queue.name=hdpuser007_queue02 -D mapred.job.name=daily_registereduser_record_SQOOP \
--connect jdbc:oracle:thin:@loacalhost:1521:orcl \
--username xxx \
--password xxx \
--query "SELECT * FROM USERDATA.daily_registereduser_record WHERE ${updated} \$CONDITIONS" \
--m 1 --hive-table user_bhvr.orcl _USERDATA_daily_registereduser_record_delta \
--hive-drop-import-delims \
--null-non-string '\\N' \
--null-string '\\N' \
--target-dir /apps-data/hdpuser007/user_bhvr/orcl _USERDATA_daily_registereduser_record_delta \
--hive-partition-key y,m,d \
--hive-partition-value 2019,07,02 \
--hive-import \
--hive-overwrite \
--delete-target-dir

为了不引起歧义，语法问题都建议先参考Apache官网的文档，用“sqoop version”可知，我用的是1.4.5-cdh5.4.2版本的，关于这个版本的Sqoop User Guide链接如下：

http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html

首先看属于hive arguments的语句：

--hive-import：如果要把数据导入hive，就用这句，不需要解释；官网说的是， Import tables into Hive (Uses Hive’s default delimiters if none are set.)。
--hive-overwrite：如果没有加上overwrite，重复使用这个sqoop语句会在同一个（指定）目录下建多个文件，如part-m-00000,part-m-00001等；官网定义，Overwrite existing data in the Hive table.
--hive-drop-import-delims：官网定义 Drops \n, \r, and \01 from string fields when importing to Hive.
--hive-partition-key：官网定义Name of a hive field to partition are sharded on.
--hive-partition-value <v>：一看就是和上面那个key配套使用，且这个value必须是字符；官网定义String-value that serves as partition key for this imported into hive in this job. Hive can put data into partitions for more efficient query performance. You can tell a Sqoop job to import data for Hive into a particular partition by specifying the --hive-partition-key and --hive-partition-value arguments. The partition value must be a string.

再看属于Import control arguments的语句：

--warehouse-dir <dir> ：这个字段是和 --table 一起使用的，不属于咱们这个例子，但还是想说说它，如果不加这个字段的话，sqoop就会把文件放到当前用户的默认目录下（By default, Sqoop will import a table named foo to a directory named foo inside your home directory in HDFS. For example, if your username is someuser, then the import tool will write to /user/someuser/foo/(files)）；如果加上这个字段，即<dir>这个路径，会自动生成和 --table 后面跟的表同名的目录，目录下存数据文件；且如果多个不同table都用同一个父目录，这个父目录下可以存多张表；官网定义 HDFS parent for table destination。
--target-dir <dir>： <dir>这个目录下临时路径，同步完成后会清空，存的就是sqoop导入表的数据文件。在导入 arbitrary aql query 或者说是free-form query的时候必须用，也就是 --query（Sqoop can also import the result set of an arbitrary SQL query. Instead of using the --table, --columns and --where arguments, you can specify a SQL statement with the --query argument.When importing a free-form query, you must specify a destination directory with --target-dir）；因为任意query是没有名字的，sqoop不指定该在hdfs系统里创建什么名字的目录，只能先在sqoop语句里定义好--target-dir ；要注意的是，--target dir 和 --warehouse-dir不能同时使用；官网定义， HDFS destination dir。
--delete-target-dir：加上这个字段是比较保险的，如果在导数过程中出现hdfs文件已经有了，但hive里没数据的情况，这时候就需要重新导入。重新导入的时候，系统如果发现hdfs系统里已经有--target-dir 对应的文件夹了，就会报错（ERROR tool.ImportTool: Encountered IOException running import job: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://.../apps-data/hdpuser007/user_bhvr/orcl _USERDATA_daily_registereduser_record_delta already exists），加上--delete-target-dir 会让系统自动删除文件夹，然后顺畅的走导入流程。
--null-string <null-string> ：对字符型列的null值的处理；官网定义， The string to be interpreted as null for string columns.
--null-non-string <null-string> ：对非字符型列的null值的处理；官网定义，The string to be interpreted as null for non string columns.

关于--warehouse-dir和--target-dir ，还可以参考这篇文章，写得很清楚：http://f.dataguru.cn/hadoop-914126-1-1.html

其他语句：

--m 1：是和如果我们想顺序导入的话，可以只开一个线程，官网是这么说滴，the query can be executed once and imported serially, by specifying a single map task with -m 1；和这个诉求相对的就是并行导入，需要和--split-by结合使用（import the results of a query in parallel， You must also select a splitting column with --split-by）。当然，无论是串行还是并行，都要和$conditions 一起使用（ Your query must include the token $CONDITIONS which each Sqoop process will replace with a unique condition expression）。要注意的是，咱们这例子里，where语句中有用单引号的，所以要加个反斜杠 "...... \$CONDITIONS"。官网也说了，Note：If you are issuing the query wrapped with double quotes ("), you will have to use \$CONDITIONS instead of just $CONDITIONS to disallow your shell from treating it as a shell variable. For example, a double quoted query may look like:

　　"SELECT * FROM x WHERE a='foo' AND \$CONDITIONS"

其他情况：

当hive里没有这个表 orcl _USERDATA_daily_registereduser_record_delta 的时候，用sqoop语句可以自动创建。

如果说hive数据库里已经有这个表了，用sqoop语句也一样可以把数据导进去，只要这个表结构和分区是正确的。

参考：

apache官网sqoop user guide：http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html
--target-dir与--warehouse-dir的区别：http://f.dataguru.cn/hadoop-914126-1-1.html
sqoop导入数据到hive： https://www.cnblogs.com/dongdone/p/5696233.html
sqoop 常用命令整理，中文好理解： http://www.aboutyun.com/thread-9983-1-1.html
${} 和 #{}的区别： https://ww的w.cnblogs.com/eastwjn/p/9699966.html
这一篇也很好，常用命令整理：https://blog.csdn.net/jerrydzan/article/details/88527619