关注云端搜索技术：elasticsearch，nutch，hadoop，nosql，mongodb，hbase，cassandra 及Hadoop优化

http://www.searchtech.pro/

Hadoop添加或调整的参数：

一、hadoop-env.sh
1、hadoop的heapsize的设置，默认1000

# The maximum amount of heap to use, in MB. Default is 1000.
# export HADOOP_HEAPSIZE=2000

2、改变pid的路径，pid文件默认在/tmp目录下，而/tmp是会被系统定期清理的

# The directory where pid files are stored. /tmp by default.
# export HADOOP_PID_DIR=/var/hadoop/pids

二、core-site.xml
1、hadoop.tmp.dir 是hadoop文件系统依赖的基础配置(默认值/tmp)，尽量手动配置这个选项。

<property>
　　<name>hadoop.tmp.dir</name>
　　<value>/tmp/hadoop-${user.name}</value>
　　<description>A base for other temporary directories.</description>
</property>

2、SequenceFiles在读写中可以使用的缓存大小，可减少 I/O 次数。默认4096(byte)，建议可设定为 65536 到 131072。

<property>
　　<name>io.file.buffer.size</name>
　　<value>4096</value>
　　<description>The size of buffer for use in sequence files.
　　　　The size of this buffer should probably be a multiple of hardware
　　　　page size (4096 on Intel x86), and it determines how much data is
　　　　buffered during read and write operations.</description>
</property>

三、hdfs-site.xml
1、默认hdfs里每個block是 67108864(64MB)。如果確定存取的文件块都很大可以改為 134217728(128MB)

<property>
    <name>dfs.block.size</name>
    <value>67108864</value>
    <description>The default block size for new files.</description>
</property>

2、Hadoop启动时会进入safe mode，也就是安全模式，這时是不能写入数据的。只有当设置的的blocks(默认0.999f)达到最小的dfs.replication.min数量才会离开safe mode。

在 dfs.replication.min 设的比较大或 data nodes 数量比较多时会等比较久。

<property>
　　<name>dfs.safemode.threshold.pct</name>
　　<value>0.999f</value>
　　<description>
　　　　Specifies the percentage of blocks that should satisfy
　　　　the minimal replication requirement defined by dfs.replication.min.
　　　　Values less than or equal to 0 mean not to start in safe mode.
　　　　Values greater than 1 will make safe mode permanent.
　　</description>
</property>

3、指定namenode、datanode 的存储路径，默认保存在 ${hadoop.tmp.dir}/dfs/ 目录里。

<property>
    <name>dfs.name.dir</name>
    <value>${hadoop.tmp.dir}/dfs/name</value>
    <description>Directory in NameNode's local filesystem to store HDFS's metadata.</description>
</property>

<property>
    <name>dfs.data.dir</name>
    <value>${hadoop.tmp.dir}/dfs/data</value> 
　　<description>Directory in a DataNode's local filesystem to store HDFS's file blocks.</description> 
</property>

四、mapred-site.xml
1、缓存map中间结果的buffer大小(默认100MB)

map任务运算产生的中间结果并非直接写入磁盘。hadoop利用内存buffer对部分结果缓存，并在内存buffer中进行一些预排序来优化整个map的性能。
map在运行过程中，不停的向该buffer中写入计算结果，但是该buffer并不一定能将全部的map输出缓存下来，当map输出超出一定阈值，那么map就必须将该buffer中的数据写入到磁盘中去，这个过程在mapreduce中叫做spill，把io.sort.mb调大，则map的spill的次数就会降低，map任务对磁盘的操作变少，最终提高map的计算性能。

<property>
    <name>io.sort.mb</name>
    <value>100</value>
    <description>
       The total amount of buffer memory to use while sorting
　　　　files, in megabytes. By default, gives each merge stream 1MB, which should minimize seeks.
    </description>
</property>

2、排序文件merge时，一次合并的文件上限(默认10)。

当map任务全部完成后，就会生成一个或者多个spill文件。map在正常退出之前，需要将这些spill合并（merge）成一个。
参数io.sort.factor调整merge行为，它表示最多能有多少并行的stream向merge文件中写入。
调大io.sort.factor，有利于减少merge次数，进而减少map对磁盘的读写频率，最终提高map的计算性能。

<property>
    <name>io.sort.factor</name>
    <value>10</value>
    <description>The number of streams to merge at once while sorting files. This determines the number of open file handles.    </description>
</property>

3、JobTracker的管理线程数(默认10)

<property>
    <name>mapred.job.tracker.handler.count</name>
    <value>10</value>
    <description>
       The number of server threads for the JobTracker. This should be roughly
　　　　4% of the number of tasktracker nodes.
    </description>
</property>

4、每个作业的map/reduce任务数(map默认2，reduce默认1)

<property>
　　<name>mapred.map.tasks</name>
　　<value>2</value>
　　<description>The default number of map tasks per job.Ignored when mapred.job.tracker is "local". </description>
</property>

<property>
　　<name>mapred.reduce.tasks</name>
　　<value>1</value>
　　<description>The default number of reduce tasks per job. Typically set to 99%
　　　　of the cluster's reduce capacity, so that if a node fails the reduces
　　　　can still be executed in a single wave. Ignored when mapred.job.tracker is "local".
　　</description>
</property>

5、单个tasktracker节点最多可并行执行的map/reduce任务数(map默认2，reduce默认2)

<property>
　　<name>mapred.tasktracker.map.tasks.maximum</name>
　　<value>2</value>
　　<description>The maximum number of map tasks that will be run simultaneously by a task tracker. </description>
</property>

<property>
　　<name>mapred.tasktracker.reduce.tasks.maximum</name>
　　<value>2</value>
　　<description>The maximum number of reduce tasks that will be run simultaneously by a task tracker. </description>
</property>

6、限制每个用户在JobTracker的内存中保存已完成任务的个数(默认100)。
任务被扔到历史作业之前完成的最大任务数，一般我们只会关注运行中的队列，所以可以考虑降低它的值，减少内存资源占用
这个参数在0.21.0以后的版本已经没有必要设置了，因为0.21版本改造了completeuserjobs的算法，让它尽快的写入磁盘，不再内存中长期存在了。

<property>
　　<name>mapred.jobtracker.completeuserjobs.maximum</name>
　　<value>100</value>
　　<description>The maximum number of complete jobs per user to keep
　　　　around before delegating them to the job history.
　　</description>
</property>

7、从map复制时reduce并行传送的值(默认5)
默认情况下，每个reduce只会有5个并行的下载线程从map复制数据，如果一个时间段内job完成的map有100个或者更多，那么reduce也最多只能同时下载5个map的数据。
所以这个参数比较适合map很多并且完成的比较快的job的情况下调大，有利于reduce更快的获取属于自己部分的数据。

<property>
　　<name>mapred.reduce.parallel.copies</name>
　　<value>5</value>
　　<description>The default number of parallel transfers run by reduce during the copy(shuffle) phase.</description>
</property>

8、Map的输出中间结果时是否压缩(默认false)
将这个参数设置为true时，那么map在写中间结果时，就会将数据压缩后再写入磁盘，读结果时也会采用先解压后读取数据。
这样做的后果就是：写入磁盘的中间结果数据量会变少，但是cpu会消耗一些用来压缩和解压。
所以这种方式通常适合job中间结果非常大，瓶颈不在cpu，而是在磁盘的读写的情况。说的直白一些就是用cpu换IO。

<property>
　　<name>mapred.compress.map.output</name>
　　<value>false</value>
　　<description>Should the outputs of the maps be compressed before being sent across the network. Uses SequenceFile compression. </description>
</property>

10、启动tasktracker的子进程时的heapsize设置(默认200M)

<property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx200m -verbose:gc -Xloggc:/tmp/@taskid@.gc</value>
    <description>Java opts for the task tracker child processes.
        The following symbol, if present, will be interpolated: @taskid@ is
        replaced
        by current TaskID. Any other occurrences of '@' will go unchanged.
        For example, to enable verbose gc logging to a file named for the taskid
        in /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of:
        -Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc

        The configuration variable mapred.child.ulimit can be used to control
        the maximum virtual memory of the child processes. 
   　</description>
</property>