实习总结

常用命令：

mvn打包文件：mvn archetype:create -DartifactId=Test -DgroupId=com.laiwang.algo.antisam.test -DarchetypeArtifactId=maven-archetype-profiles -DpackageName=com.laiwang.algo.antispam.test

至于创建好后的pom.xml就去别人那拷吧~

代码的第一行是XML头，指定了该xml文档的版本和编码方式。紧接着是project元素，project是所有pom.xml的根元素，它还声明了一些POM相关的命名空间及xsd元素，虽然这些属性不是必须的，但使用这些属性能够让第三方工具（如IDE中的XML编辑器）帮助我们快速编辑POM。

根元素下的第一个子元素modelVersion指定了当前POM模型的版本，对于Maven2及Maven 3来说，它只能是4.0.0。

这段代码中最重要的是groupId，artifactId和version三行。这三个元素定义了一个项目基本的坐标，在Maven的世界，任何的jar、pom或者war都是以基于这些基本的坐标进行区分的。

groupId定义了项目属于哪个组，这个组往往和项目所在的组织或公司存在关联，譬如你在googlecode上建立了一个名为myapp的项目，那么groupId就应该是com.googlecode.myapp，如果你的公司是mycom，有一个项目为myapp，那么groupId就应该是com.mycom.myapp。本书中所有的代码都基于groupId com.juvenxu.mvnbook。

artifactId定义了当前Maven项目在组中唯一的ID，我们为这个Hello World项目定义artifactId为hello-world，本书其他章节代码会被分配其他的artifactId。而在前面的groupId为com.googlecode.myapp的例子中，你可能会为不同的子项目（模块）分配artifactId，如：myapp-util、myapp-domain、myapp-web等等。

顾名思义，version指定了Hello World项目当前的版本——1.0-SNAPSHOT。SNAPSHOT意为快照，说明该项目还处于开发中，是不稳定的版本。随着项目的发展，version会不断更新，如升级为1.0、1.1-SNAPSHOT、1.1、2.0等等。

没有任何实际的Java代码，我们就能够定义一个Maven项目的POM，这体现了Maven的一大优点，它能让项目对象模型最大程度地与实际代码相独立，我们可以称之为解耦，或者正交性，这在很大程度上避免了Java代码和POM代码的相互影响。比如当项目需要升级版本时，只需要修改POM，而不需要更改Java代码；而在POM稳定之后，日常的Java代码开发工作基本不涉及POM的修改。

插入一份pom供参考

 1  <modelVersion>4.0.0</modelVersion>
 2   <groupId>com.laiwang.algo.antispam.test</groupId>
 3   <artifactId>Test</artifactId>
 4   <packaging>jar</packaging>
 5   <version>1.0-SNAPSHOT</version>
 6   <name>Maven Quick Start Archetype</name>
 7   <url>http://maven.apache.org</url>
 8   <properties>
 9     <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
10   </properties>
11   <build>
12         <defaultGoal>package</defaultGoal>
13     <plugins>
14       <plugin>
15         <artifactId>maven-compiler-plugin</artifactId>
16         <version>2.3.2</version>
17         <configuration>
18           <source>1.6</source>
19           <target>1.6</target>
20           <encoding>UTF-8</encoding>
21         </configuration>
22       </plugin>
23     </plugins>
24   </build>
25   <dependencies>
26           <dependency>
27                     <groupId>mysql</groupId>
28                           <artifactId>mysql-connector-java</artifactId>
29                             <version>5.1.6</version>
30                         </dependency>
31           <dependency>
32       <groupId>junit</groupId>
33       <artifactId>junit</artifactId>
34       <version>3.8.1</version>
35       <scope>test</scope>
36     </dependency>
37         <dependency>
38                   <groupId>org.apache.hadoop</groupId>
39                     <artifactId>hadoop-common</artifactId>
40                           <version>2.0.0-alpha</version>
41         </dependency>
42           <dependency>
43       <groupId>junit</groupId>
44       <artifactId>junit</artifactId>
45       <version>3.8.1</version>
46       <scope>test</scope>
47     </dependency>
48         <dependency>
49                   <groupId>org.apache.hadoop</groupId>
50                     <artifactId>hadoop-common</artifactId>
51                           <version>2.0.0-alpha</version>
52         </dependency>
53         <dependency>
54             <groupId>org.apache.hadoop</groupId>
55                   <artifactId>hadoop-core</artifactId>
56                     <version>2.0.0-mr1-cdh4.5.0</version>
57    </dependency>
58    <dependency>
59              <groupId>com.github.jsimone</groupId>
60                    <artifactId>webapp-runner</artifactId>
61                      <version>7.0.34.0</version>
62    </dependency>
63    <dependency>
64              <groupId>commons-io</groupId>
65                    <artifactId>commons-io</artifactId>
66                      <version>2.4</version>
67    </dependency>
68    <dependency>
69       <groupId>commons-logging</groupId>
70       <artifactId>commons-logging</artifactId>
71       <version>1.1.1</version>
72     </dependency>
73     <dependency>
74       <groupId>log4j</groupId>
75       <artifactId>log4j</artifactId>
76       <version>1.2.15</version>
77     </dependency>
78     <dependency>
79       <groupId>commons-cli</groupId>
80       <artifactId>commons-cli</artifactId>
81       <version>1.2</version>
82     </dependency>
83   </dependencies>
84 </project>

View Code

运行：java -jar target/Test-1.0-SNAPSHOT.jar java -cp target/Test-1.0-SNAPSHOT.jar com.laiwang.algo.antispam.test.App

然后就是svn 详见http://blog.csdn.net/ithomer/article/details/6187464

在svn上创建一个目录，co来下不停的add 然后st 确认无误后 ci svn up 更新到某个版本

接下来是shell 脚本

[ -f "$file" ] 判断$file是否是一个文件

[ $a -lt 3 ] 判断$a的值是否小于3，同样-gt和-le分别表示大于或小于等于

[ -x "$file" ] 判断$file是否存在且有可执行权限，同样-r测试文件可读性

[ -n "$a" ] 判断变量$a是否有值，测试空串用-z

[ "$a" = "$b" ] 判断$a和$b的取值是否相等

[ cond1 -a cond2 ] 判断cond1和cond2是否同时成立，-o表示cond1和cond2有一成立

 1 测试的标志    代表意义
 2 1. 关於某个档名的『文件类型』判断，如 test -e filename 表示存在否
 3 -e    该『档名』是否存在？(常用)
 4 -f    该『档名』是否存在且为文件(file)？(常用)
 5 -d    该『档名』是否存在且为目录(directory)？(常用)
 6 -b    该『档名』是否存在且为一个 block device 装置？
 7 -c    该『档名』是否存在且为一个 character device 装置？
 8 -S    该『档名』是否存在且为一个 Socket 文件？
 9 -p    该『档名』是否存在且为一个 FIFO (pipe) 文件？
10 -L    该『档名』是否存在且为一个连结档？
11 2. 关於文件的权限侦测，如 test -r filename 表示可读否 (但 root 权限常有例外)
12 -r    侦测该档名是否存在且具有『可读』的权限？
13 -w    侦测该档名是否存在且具有『可写』的权限？
14 -x    侦测该档名是否存在且具有『可运行』的权限？
15 -u    侦测该档名是否存在且具有『SUID』的属性？
16 -g    侦测该档名是否存在且具有『SGID』的属性？
17 -k    侦测该档名是否存在且具有『Sticky bit』的属性？
18 -s    侦测该档名是否存在且为『非空白文件』？
19 3. 两个文件之间的比较，如： test file1 -nt file2
20 -nt    (newer than)判断 file1 是否比 file2 新
21 -ot    (older than)判断 file1 是否比 file2 旧
22 -ef    判断 file1 与 file2 是否为同一文件，可用在判断 hard link 的判定上。 主要意义在判定，两个文件是否均指向同一个 inode 哩！
23 4. 关於两个整数之间的判定，例如 test n1 -eq n2
24 -eq    两数值相等 (equal)
25 -ne    两数值不等 (not equal)
26 -gt    n1 大於 n2 (greater than)
27 -lt    n1 小於 n2 (less than)
28 -ge    n1 大於等於 n2 (greater than or equal)
29 -le    n1 小於等於 n2 (less than or equal)
30 5. 判定字串的数据
31 test -z string    判定字串是否为 0 ？若 string 为空字串，则为 true
32 test -n string    判定字串是否非为 0 ？若 string 为空字串，则为 false。
33 注： -n 亦可省略
34 test str1 = str2    判定 str1 是否等於 str2 ，若相等，则回传 true
35 test str1 != str2    判定 str1 是否不等於 str2 ，若相等，则回传 false
36 6. 多重条件判定，例如： test -r filename -a -x filename
37 -a    (and)两状况同时成立！例如 test -r file -a -x file，则 file 同时具有 r 与 x 权限时，才回传 true。
38 -o    (or)两状况任何一个成立！例如 test -r file -o -x file，则 file 具有 r 或 x 权限时，就可回传 true。
39 !    反相状态，如 test ! -x file ，当 file 不具有 x 时，回传 true

View Code

# 那么如果想要列出第 3 与第 5 呢？，就是这样：
[root@www ~]# echo $PATH | cut -d ':' -f 3,5

范例二：/etc/passwd 内容是以 : 来分隔的，我想以第三栏来排序，该如何？
[root@www ~]# cat /etc/passwd | sort -t ':' -k 3

范例一：将 last 输出的信息中，所有的小写变成大写字符：
[root@www ~]# last | tr '[a-z]' '[A-Z]'

范例二：将 /etc/passwd 输出的信息中，将冒号 (:) 删除
[root@www ~]# cat /etc/passwd | tr -d ':'

awk

hadoop fs -text /checkout/* | awk '{if($3=="199.168.148.90") sum += $2 } END {print sum}'
awk '{if($2>50) print $0}'
awk '{if(match($1,"www.laiwang.com/user/qrcode/is_logged.json") ) sum += $2 } END {print sum}'
awk '{ sum += $2 } END {print sum}'
awk '{if(match($1,"www.laiwang.com/user/qrcode/is_logged.json") && $3=="183.63.136.35") sum += $2 } END {print sum}'

hadoop配置也可以在pom中配置

mapred.child.java.opts 堆大小设置一般为<value>-Xmx1024m</value>
io.sort.mb <value>900</value>
当map task开始运算，并产生中间数据时，其产生的中间结果并非直接就简单的写入磁盘。这中间的过程比较复杂，
并且利用到了内存buffer来进行已经产生的部分结果的缓存，并在内存buffer中进行一些预排序来优化整个map的性能。
每一个map都会对应存在一个内存buffer（MapOutputBuffer，即上图的buffer in memory），
map会将已经产生的部分结果先写入到该buffer中，这个buffer默认是100MB大小，
但是这个大小是可以根据job提交时的参数设定来调整的，该参数即为：io.sort.mb

<property> |~
<name>io.sort.spill.percent</name> |~
<value>0.95</value> |~
<description></description> |~
</property>
当map输出超出一定阈值（比如100M），那么map就必须将该buffer中的数据写入到磁盘中去，这个过程在mapreduce中叫做spill。
map并不是要等到将该buffer全部写满时才进行spill，因为如果全部写满了再去写spill，势必会造成map的计算部分等待buffer释放空间的情况。
所以，map其实是当buffer被写满到一定程度（比如80%）时，就开始进行spill。这个阈值也是由一个job的配置参数来控制，即io.sort.spill.percent，
默认为0.80或80%

也可以配置各自名字的变量