Hadoop Job使用第三方依赖jar文件

Hadoop Job使用第三方依赖jar文件

当我们实现了一个Hadoop MapReduce Job以后,而这个Job可能又依赖很多外部的jar文件,在Hadoop集群上运行时,有时会出现找不到具体Class的异常。出现这种问题,基本上就是在Hadoop Job执行过程中,没有从执行的上下文中找到对应的jar文件(实际是unjar的目录,目录里面是对应的Class文件)。所以,我们自然而然想到,正确配置好对应的classpath,MapReduce Job运行时就能够找到。
有两种方式可以更好地实现,一种是设置HADOOP_CLASSPATH,将Job所依赖的jar文件加载到HADOOP_CLASSPATH,这种配置只针对该Job生效,Job结束之后HADOOP_CLASSPATH会被清理;另一种方式是,直接在构建代码的时候,将依赖jar文件与Job代码打成一个jar文件,这种方式可能会使得最终的jar文件比较大,但是结合一些代码构建工具,如Maven,可以在依赖控制方面保持一个Job一个依赖的构建配置,便于管理。下面,我们分别说明这两种方式。

设置HADOOP_CLASSPATH

比如,我们有一个使用HBase的应用,操作HBase数据库中表,肯定需要ZooKeeper,所以对应的jar文件的位置都要设置正确,让运行时Job能够检索并加载。
Hadoop实现里面,有个辅助工具类org.apache.hadoop.util.GenericOptionsParser,能够帮助我们加载对应的文件到classpath中,操作比较容易一些。
下面我们是我们实现的一个例子,程序执行入口的类,代码如下所示:

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
package org.shirdrn.kodz.inaction.hbase.job.importing;
 
import java.io.IOException;
import java.net.URISyntaxException;
 
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
 
/**
* Table DDL: create 't_sub_domains', 'cf_basic', 'cf_status'
* <pre>
* cf_basic:domain cf_basic:len
* cf_status:status cf_status:live
* </pre>
*
* @author shirdrn
*/
public class DataImporter {
 
     public static void main(String[] args)
               throws IOException, InterruptedException, ClassNotFoundException, URISyntaxException {
          
          Configuration conf = HBaseConfiguration.create();
          String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
          
          assert(otherArgs.length == 2);
          
          if(otherArgs.length < 2) {
               System.err.println("Usage: " +
                         " ImportDataDriver -libjars <jar1>[,<jar2>...[,<jarN>]] <tableName> <input>");
               System.exit(1);
          }
          String tableName = otherArgs[0].trim();
          String input = otherArgs[1].trim();
          
          // set table columns
          conf.set("table.cf.family", "cf_basic");
          conf.set("table.cf.qualifier.fqdn", "domain");
          conf.set("table.cf.qualifier.timestamp", "create_at");
                    
          Job job = new Job(conf, "Import into HBase table");
          job.setJarByClass(DataImporter.class);
          job.setMapperClass(ImportFileLinesMapper.class);
          job.setOutputFormatClass(TableOutputFormat.class);
          
          job.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, tableName);
          job.setOutputKeyClass(ImmutableBytesWritable.class);
          job.setOutputValueClass(Put.class);
          
          job.setNumReduceTasks(0);
          
          FileInputFormat.addInputPath(job, new Path(input));
          
          int exitCode = job.waitForCompletion(true) ? 0 : 1;
          System.exit(exitCode);
     }
 
}

可以看到,我们可以通过-libjars选项来指定该Job运行所依赖的第三方jar文件,具体使用方法,说明如下:

  • 第一步:设置环境变量

我们修改.bashrc文件,增加如下配置内容:

1
2
3
4
5
export HADOOP_HOME=/opt/stone/cloud/hadoop-1.0.3
export PATH=$PATH:$HADOOP_HOME/bin
export HBASE_HOME=/opt/stone/cloud/hbase-0.94.1
export PATH=$PATH:$HBASE_HOME/bin
export ZK_HOME=/opt/stone/cloud/zookeeper-3.4.3

不要忘记要使当前的配置生效:

1
2
3
. .bashrc
source .bashrc

这样就可以方便地引用外部的jar文件了。

  • 第二步:确定Job依赖的jar文件列表

上面提到,我们要使用HBase,需要HBase和ZooKeeper的相关jar文件,用到的文件如下所示:

1
HADOOP_CLASSPATH=$HBASE_HOME/hbase-0.94.1.jar:$ZK_HOME/zookeeper-3.4.3.jar ./bin/hadoop jar import-into-hbase.jar

设置当前Job执行的HADOOP_CLASSPATH变量,只对当前Job有效,所以没有必要在.bashrc中进行配置。

  • 第三步:运行开发的Job

运行我们开发的Job,通过命令行输入HADOOP_CLASSPATH变量,以及使用-libjars选项指定当前这个Job依赖的第三方jar文件,启动命令行如下所示:

1
xiaoxiang@ubuntu3:~/hadoop$ HADOOP_CLASSPATH=$HBASE_HOME/hbase-0.94.1.jar:$ZK_HOME/zookeeper-3.4.3.jar ./bin/hadoop jar import-into-hbase.jar org.shirdrn.kodz.inaction.hbase.job.importing.ImportDataDriver -libjars $HBASE_HOME/hbase-0.94.1.jar,$HBASE_HOME/lib/protobuf-java-2.4.0a.jar,$ZK_HOME/zookeeper-3.4.3.jar t_sub_domains /user/xiaoxiang/datasets/domains/

需要注意的是,环境变量中内容使用冒号分隔,而-libjars选项中的内容使用逗号分隔。

这样,我们就能够正确运行开发的Job了。
下面看看我们开发的Job运行的结果:

001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
045
046
047
048
049
050
051
052
053
054
055
056
057
058
059
060
061
062
063
064
065
066
067
068
069
070
071
072
073
074
075
076
077
078
079
080
081
082
083
084
085
086
087
088
089
090
091
092
093
094
095
096
097
098
099
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
13/04/10 22:03:32 INFO zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.3-1240972, built on 02/06/2012 10:48 GMT
13/04/10 22:03:32 INFO zookeeper.ZooKeeper: Client environment:host.name=ubuntu3
13/04/10 22:03:32 INFO zookeeper.ZooKeeper: Client environment:java.version=1.6.0_30
13/04/10 22:03:32 INFO zookeeper.ZooKeeper: Client environment:java.vendor=Sun Microsystems Inc.
13/04/10 22:03:32 INFO zookeeper.ZooKeeper: Client environment:java.home=/usr/java/jdk1.6.0_30/jre
13/04/10 22:03:32 INFO zookeeper.ZooKeeper: Client environment:java.class.path=/opt/stone/cloud/hadoop-1.0.3/libexec/../conf:/usr/java/jdk1.6.0_30/lib/tools.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/..:/opt/stone/cloud/hadoop-1.0.3/libexec/../hadoop-core-1.0.3.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/asm-3.2.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/aspectjrt-1.6.5.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/aspectjtools-1.6.5.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/commons-beanutils-1.7.0.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/commons-beanutils-core-1.8.0.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/commons-cli-1.2.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/commons-codec-1.4.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/commons-collections-3.2.1.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/commons-configuration-1.6.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/commons-daemon-1.0.1.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/commons-digester-1.8.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/commons-el-1.0.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/commons-httpclient-3.0.1.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/commons-io-2.1.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/commons-lang-2.4.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/commons-logging-1.1.1.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/commons-logging-api-1.0.4.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/commons-math-2.1.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/commons-net-1.4.1.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/core-3.1.1.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/hadoop-capacity-scheduler-1.0.3.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/hadoop-datajoin-1.0.3.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/hadoop-fairscheduler-1.0.3.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/hadoop-thriftfs-1.0.3.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/hsqldb-1.8.0.10.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/jackson-core-asl-1.8.8.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/jackson-mapper-asl-1.8.8.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/jasper-compiler-5.5.12.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/jasper-runtime-5.5.12.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/jdeb-0.8.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/jersey-core-1.8.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/jersey-json-1.8.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/jersey-server-1.8.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/jets3t-0.6.1.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/jetty-6.1.26.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/jetty-util-6.1.26.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/jsch-0.1.42.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/junit-4.5.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/kfs-0.2.2.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/log4j-1.2.15.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/mockito-all-1.8.5.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/oro-2.0.8.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/protobuf-java-2.4.0a.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/servlet-api-2.5-20081211.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/slf4j-api-1.4.3.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/slf4j-log4j12-1.4.3.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/xmlenc-0.52.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/jsp-2.1/jsp-2.1.jar:/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/jsp-2.1/jsp-api-2.1.jar:/opt/stone/cloud/hbase-0.94.1/hbase-0.94.1.jar:/opt/stone/cloud/zookeeper-3.4.3/zookeeper-3.4.3.jar
13/04/10 22:03:32 INFO zookeeper.ZooKeeper: Client environment:java.library.path=/opt/stone/cloud/hadoop-1.0.3/libexec/../lib/native/Linux-amd64-64
13/04/10 22:03:32 INFO zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp
13/04/10 22:03:32 INFO zookeeper.ZooKeeper: Client environment:java.compiler=<NA>
13/04/10 22:03:32 INFO zookeeper.ZooKeeper: Client environment:os.name=Linux
13/04/10 22:03:32 INFO zookeeper.ZooKeeper: Client environment:os.arch=amd64
13/04/10 22:03:32 INFO zookeeper.ZooKeeper: Client environment:os.version=3.0.0-12-server
13/04/10 22:03:32 INFO zookeeper.ZooKeeper: Client environment:user.name=xiaoxiang
13/04/10 22:03:32 INFO zookeeper.ZooKeeper: Client environment:user.home=/home/xiaoxiang
13/04/10 22:03:32 INFO zookeeper.ZooKeeper: Client environment:user.dir=/opt/stone/cloud/hadoop-1.0.3
13/04/10 22:03:32 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=ubuntu3:2222 sessionTimeout=180000 watcher=hconnection
13/04/10 22:03:32 INFO zookeeper.ClientCnxn: Opening socket connection to server /172.0.8.252:2222
13/04/10 22:03:32 INFO zookeeper.RecoverableZooKeeper: The identifier of this process is 17561@ubuntu3
13/04/10 22:03:32 WARN client.ZooKeeperSaslClient: SecurityException: java.lang.SecurityException: Unable to locate a login configuration occurred when trying to find JAAS configuration.
13/04/10 22:03:32 INFO client.ZooKeeperSaslClient: Client will not SASL-authenticate because the default JAAS configuration section 'Client' could not be found. If you are not using SASL, you may ignore this. On the other hand, if you expected SASL to work, please fix your JAAS configuration.
13/04/10 22:03:32 INFO zookeeper.ClientCnxn: Socket connection established to ubuntu3/172.0.8.252:2222, initiating session
13/04/10 22:03:32 INFO zookeeper.ClientCnxn: Session establishment complete on server ubuntu3/172.0.8.252:2222, sessionid = 0x13decd0f3960042, negotiated timeout = 180000
13/04/10 22:03:32 INFO mapreduce.TableOutputFormat: Created table instance for t_sub_domains
13/04/10 22:03:32 INFO input.FileInputFormat: Total input paths to process : 1
13/04/10 22:03:32 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/04/10 22:03:32 WARN snappy.LoadSnappy: Snappy native library not loaded
13/04/10 22:03:32 INFO mapred.JobClient: Running job: job_201303302227_0034
13/04/10 22:03:33 INFO mapred.JobClient:  map 0% reduce 0%
13/04/10 22:03:50 INFO mapred.JobClient:  map 2% reduce 0%
13/04/10 22:03:53 INFO mapred.JobClient:  map 3% reduce 0%
13/04/10 22:03:56 INFO mapred.JobClient:  map 4% reduce 0%
13/04/10 22:03:59 INFO mapred.JobClient:  map 6% reduce 0%
13/04/10 22:04:03 INFO mapred.JobClient:  map 7% reduce 0%
13/04/10 22:04:06 INFO mapred.JobClient:  map 8% reduce 0%
13/04/10 22:04:09 INFO mapred.JobClient:  map 10% reduce 0%
13/04/10 22:04:15 INFO mapred.JobClient:  map 12% reduce 0%
13/04/10 22:04:18 INFO mapred.JobClient:  map 13% reduce 0%
13/04/10 22:04:21 INFO mapred.JobClient:  map 14% reduce 0%
13/04/10 22:04:24 INFO mapred.JobClient:  map 15% reduce 0%
13/04/10 22:04:27 INFO mapred.JobClient:  map 17% reduce 0%
13/04/10 22:04:33 INFO mapred.JobClient:  map 18% reduce 0%
13/04/10 22:04:36 INFO mapred.JobClient:  map 19% reduce 0%
13/04/10 22:04:39 INFO mapred.JobClient:  map 20% reduce 0%
13/04/10 22:04:42 INFO mapred.JobClient:  map 21% reduce 0%
13/04/10 22:04:45 INFO mapred.JobClient:  map 23% reduce 0%
13/04/10 22:04:48 INFO mapred.JobClient:  map 24% reduce 0%
13/04/10 22:04:51 INFO mapred.JobClient:  map 25% reduce 0%
13/04/10 22:04:54 INFO mapred.JobClient:  map 27% reduce 0%
13/04/10 22:04:57 INFO mapred.JobClient:  map 28% reduce 0%
13/04/10 22:05:00 INFO mapred.JobClient:  map 29% reduce 0%
13/04/10 22:05:03 INFO mapred.JobClient:  map 31% reduce 0%
13/04/10 22:05:06 INFO mapred.JobClient:  map 32% reduce 0%
13/04/10 22:05:09 INFO mapred.JobClient:  map 33% reduce 0%
13/04/10 22:05:12 INFO mapred.JobClient:  map 34% reduce 0%
13/04/10 22:05:15 INFO mapred.JobClient:  map 35% reduce 0%
13/04/10 22:05:18 INFO mapred.JobClient:  map 37% reduce 0%
13/04/10 22:05:21 INFO mapred.JobClient:  map 38% reduce 0%
13/04/10 22:05:24 INFO mapred.JobClient:  map 39% reduce 0%
13/04/10 22:05:27 INFO mapred.JobClient:  map 41% reduce 0%
13/04/10 22:05:30 INFO mapred.JobClient:  map 42% reduce 0%
13/04/10 22:05:33 INFO mapred.JobClient:  map 43% reduce 0%
13/04/10 22:05:36 INFO mapred.JobClient:  map 44% reduce 0%
13/04/10 22:05:39 INFO mapred.JobClient:  map 46% reduce 0%
13/04/10 22:05:42 INFO mapred.JobClient:  map 47% reduce 0%
13/04/10 22:05:45 INFO mapred.JobClient:  map 48% reduce 0%
13/04/10 22:05:48 INFO mapred.JobClient:  map 50% reduce 0%
13/04/10 22:05:54 INFO mapred.JobClient:  map 52% reduce 0%
13/04/10 22:05:57 INFO mapred.JobClient:  map 53% reduce 0%
13/04/10 22:06:00 INFO mapred.JobClient:  map 54% reduce 0%
13/04/10 22:06:03 INFO mapred.JobClient:  map 55% reduce 0%
13/04/10 22:06:06 INFO mapred.JobClient:  map 57% reduce 0%
13/04/10 22:06:12 INFO mapred.JobClient:  map 59% reduce 0%
13/04/10 22:06:15 INFO mapred.JobClient:  map 60% reduce 0%
13/04/10 22:06:18 INFO mapred.JobClient:  map 61% reduce 0%
13/04/10 22:06:21 INFO mapred.JobClient:  map 62% reduce 0%
13/04/10 22:06:24 INFO mapred.JobClient:  map 63% reduce 0%
13/04/10 22:06:27 INFO mapred.JobClient:  map 64% reduce 0%
13/04/10 22:06:30 INFO mapred.JobClient:  map 66% reduce 0%
13/04/10 22:06:33 INFO mapred.JobClient:  map 67% reduce 0%
13/04/10 22:06:36 INFO mapred.JobClient:  map 68% reduce 0%
13/04/10 22:06:42 INFO mapred.JobClient:  map 69% reduce 0%
13/04/10 22:06:45 INFO mapred.JobClient:  map 70% reduce 0%
13/04/10 22:06:48 INFO mapred.JobClient:  map 71% reduce 0%
13/04/10 22:06:51 INFO mapred.JobClient:  map 73% reduce 0%
13/04/10 22:06:54 INFO mapred.JobClient:  map 74% reduce 0%
13/04/10 22:06:57 INFO mapred.JobClient:  map 75% reduce 0%
13/04/10 22:07:00 INFO mapred.JobClient:  map 77% reduce 0%
13/04/10 22:07:03 INFO mapred.JobClient:  map 78% reduce 0%
13/04/10 22:07:12 INFO mapred.JobClient:  map 79% reduce 0%
13/04/10 22:07:18 INFO mapred.JobClient:  map 80% reduce 0%
13/04/10 22:07:24 INFO mapred.JobClient:  map 81% reduce 0%
13/04/10 22:07:30 INFO mapred.JobClient:  map 82% reduce 0%
13/04/10 22:07:36 INFO mapred.JobClient:  map 83% reduce 0%
13/04/10 22:07:48 INFO mapred.JobClient:  map 84% reduce 0%
13/04/10 22:07:51 INFO mapred.JobClient:  map 85% reduce 0%
13/04/10 22:07:59 INFO mapred.JobClient:  map 86% reduce 0%
13/04/10 22:08:05 INFO mapred.JobClient:  map 87% reduce 0%
13/04/10 22:08:11 INFO mapred.JobClient:  map 88% reduce 0%
13/04/10 22:08:17 INFO mapred.JobClient:  map 89% reduce 0%
13/04/10 22:08:23 INFO mapred.JobClient:  map 90% reduce 0%
13/04/10 22:08:29 INFO mapred.JobClient:  map 91% reduce 0%
13/04/10 22:08:35 INFO mapred.JobClient:  map 92% reduce 0%
13/04/10 22:08:41 INFO mapred.JobClient:  map 93% reduce 0%
13/04/10 22:08:47 INFO mapred.JobClient:  map 94% reduce 0%
13/04/10 22:08:53 INFO mapred.JobClient:  map 95% reduce 0%
13/04/10 22:08:59 INFO mapred.JobClient:  map 96% reduce 0%
13/04/10 22:09:05 INFO mapred.JobClient:  map 97% reduce 0%
13/04/10 22:09:11 INFO mapred.JobClient:  map 98% reduce 0%
13/04/10 22:09:17 INFO mapred.JobClient:  map 99% reduce 0%
13/04/10 22:09:23 INFO mapred.JobClient:  map 100% reduce 0%
13/04/10 22:09:31 INFO mapred.JobClient: Job complete: job_201303302227_0034
13/04/10 22:09:31 INFO mapred.JobClient: Counters: 18
13/04/10 22:09:31 INFO mapred.JobClient:   Job Counters
13/04/10 22:09:31 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=550605
13/04/10 22:09:31 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/04/10 22:09:31 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/04/10 22:09:31 INFO mapred.JobClient:     Launched map tasks=2
13/04/10 22:09:31 INFO mapred.JobClient:     Data-local map tasks=2
13/04/10 22:09:31 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
13/04/10 22:09:31 INFO mapred.JobClient:   File Output Format Counters
13/04/10 22:09:31 INFO mapred.JobClient:     Bytes Written=0
13/04/10 22:09:31 INFO mapred.JobClient:   FileSystemCounters
13/04/10 22:09:31 INFO mapred.JobClient:     HDFS_BYTES_READ=104394990
13/04/10 22:09:31 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=64078
13/04/10 22:09:31 INFO mapred.JobClient:   File Input Format Counters
13/04/10 22:09:31 INFO mapred.JobClient:     Bytes Read=104394710
13/04/10 22:09:31 INFO mapred.JobClient:   Map-Reduce Framework
13/04/10 22:09:31 INFO mapred.JobClient:     Map input records=4995670
13/04/10 22:09:31 INFO mapred.JobClient:     Physical memory (bytes) snapshot=279134208
13/04/10 22:09:31 INFO mapred.JobClient:     Spilled Records=0
13/04/10 22:09:31 INFO mapred.JobClient:     CPU time spent (ms)=129130
13/04/10 22:09:31 INFO mapred.JobClient:     Total committed heap usage (bytes)=202833920
13/04/10 22:09:31 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1170251776
13/04/10 22:09:31 INFO mapred.JobClient:     Map output records=4995670
13/04/10 22:09:31 INFO mapred.JobClient:     SPLIT_RAW_BYTES=280

可以看到,除了加载Hadoop对应的HADOOP_HOME变量指定的路径下,lib*目录下的jar文件以外,还加载了我们设置的-libjars选项中指定的第三方jar文件,供Job运行时使用。

将Job代码和依赖jar文件打包

我比较喜欢这种方式,因为这样做首先利用饿Maven的很多优点,如管理依赖、自动构建。另外,对于其他想要使用该Job的开发人员或部署人员,无需关系更多的配置,只要按照Maven的构建规则去构建,就可以生成最终的部署文件,从而也就减少了在执行Job的时候,出现各种常见的问题(如CLASSPATH设置有问题等)。
使用如下的Maven构建插件配置,执行mvn package命令,就可以完成这些任务:

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
<build>
     <plugins>
          <plugin>
               <artifactId>maven-assembly-plugin</artifactId>
               <configuration>
                    <archive>
                         <manifest>
                              <mainClass>org.shirdrn.solr.cloud.index.hadoop.SolrCloudIndexer</mainClass>
                         </manifest>
                    </archive>
                    <descriptorRefs>
                         <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
               </configuration>
               <executions>
                    <execution>
                         <id>make-assembly</id>
                         <phase>package</phase>
                         <goals>
                              <goal>single</goal>
                         </goals>
                    </execution>
               </executions>
          </plugin>
     </plugins>
</build>

最后生成的jar文件在target目录下面,例如名称类似solr-platform-2.0-jar-with-dependencies.jar,然后可以直接拷贝这个文件到指定的目录,提交到Hadoop计算集群运行。

原文地址:https://www.cnblogs.com/zzwx/p/8820206.html