大数据框架hadoop的序列化机制

Java内建序列化机制

在Windows系统上序列化的Java对象，可以在UNIX系统上被重建出来，不需要担心不同机器上的数据表示方法，也不需要担心字节排列次序。
在Java中，使一个类的实例可被序列化非常简单，只需要在类声明中加入implements Serializable即可。Serializable接口是一个标志，不具有任何成员函数，其定义如下：
public interface Serializable {
}
Block类通过声明它实现了Serializable 接口，立即可以获得Java提供的序列化功能。代码如下：
public class Block implements Writable, Comparable<Block>, Serializable

由于序列化主要应用在与I/O相关的一些操作上，其实现是通过一对输入/输出流来实现的。如果想对某个对象执行序列化动作，可以在某种OutputStream对象的基础上创建一个对象流ObjectOutputStream对象，然后调用writeObject()就可达到目的。

下面是序列化对象的例子：

Block block1=new Block(7806259420524417791L,39447755L,56736651L);

... ...

ByteArrayOutputStream out = new ByteArrayOutputStream();

ObjectOutputStream objOut = new ObjectOutputStream(out);

objOut.writeObject(block1);

但是，序列化以后的对象在尺寸上有点过于充实了，以Block类为例，它只包含3个长整数，但是它的序列化结果竟然有112字节。包含3个长整数的Block对象的序列化结果如下：

-84, -19, 0, 5, 115, 114, 0, 23, 111, 114, 103, 46, 115, 101, 97, 110, 100, 101, 110, 103, 46, 116, 101, 115, 116, 46, 66, 108, 111, 99, 107, 40, -7, 56, 46, 72, 64, -69, 45, 2, 0, 3, 74, 0, 7, 98, 108, 111, 99, 107, 73, 100, 74, 0, 16, 103, 101, 110, 101, 114, 97, 116, 105, 111, 110, 115, 83, 116, 97, 109, 112, 74, 0, 8, 110, 117, 109, 66, 121, 116, 101, 115, 120, 112, 108, 85, 103, -107, 104, -25, -110, -1, 0, 0, 0, 0, 3, 97, -69, -117, 0, 0, 0, 0, 2, 89, -20, -53

Hadoop序列化机制

和Java序列化机制不同（在对象流ObjectOutputStream对象上调用writeObject()方法），Hadoop的序列化机制通过调用对象的write()方法（它带有一个类型为DataOutput的参数），将对象序列化到流中。反序列化的过程也是类似，通过对象的readFields()，从流中读取数据。值得一提的是，Java序列化机制中，反序列化过程会不断地创建新的对象，但在Hadoop的序列化机制的反序列化过程中，用户可以复用对象，这减少了Java对象的分配和回收，提高了应用的效率。

public static void main(String[] args) {

try {

Block block1 = new Block(1L,2L,3L);

... ...

ByteArrayOutputStream bout = new ByteArrayOutputStream();

DataOutputStream dout = new DataOutputStream();

block1.write(dout);

dout.close();

... ...

}

... ...

}

由于Block对象序列化时只输出了3个长整数，block1的序列化结果一共有24字节。

Hadoop Writable机制

Hadoop引入org.apache.hadoop.io.Writable接口，作为所有可序列化对象必须实现的接口。

和java.io.Serializable不同，Writable接口不是一个说明性接口，它包含两个方法：

publicinterface Writable {

/**

* Serialize the fields of this object to <code>out</code>.

* @param out <code>DataOuput</code> to serialize this object into.

* @throws IOException

void write(DataOutput out) throws IOException;

/**

* Deserialize the fields of this object from <code>in</code>.

* For efficiency, implementations should attempt to re-use storage in the

* existing object where possible.</p>

* @param in <code>DataInput</code> to deseriablize this object from.

* @throws IOException

void readFields(DataInput in) throws IOException;

}

Writable.write(DataOutput out)方法用于将对象写入二进制的DataOutput中，反序列化的过程由readFields(DataInput in)从DataInput流中读取状态完成。下面是一个例子：

public class Block {

private long blockId;

private long numBytes;

private long generationsStamp;

public void write(DataOutput out) throws IOException {

out.writeLong(blockId);

out.writeLong(numBytes);

out.writeLong(generationsStamp);

}

public void readFields(DataInput in) throws IOException {

this.blockId = in.readLong();

this.numBytes = in.readLong();

this.generationsStamp = in.readLong();

if (numBytes < 0 ) {

throw new IOException("Unexpected block size:" + numBytes);

}

Hadoop序列化机制中还包括另外几个重要接口：WritableComparable、RawComparator和WritableComparator。

Comparable是一个对象本身就已经支持自比较所需要实现的接口（如Integer自己就可以完成比较大小操作），实现Comparable接口的方法compareTo()，通过传入要比较的对象即可进行比较。

而Comparator是一个专用的比较器，可以完成两个对象之间大小的比较。实现Comparator接口的compare()方法，通过传入需要比较的两个对象来实现对两个对象之间大小的比较。

来源： http://seandeng888.iteye.com/blog/2159914

参考：

DataOutput接口实现类有： - liango - 博客园

http://www.cnblogs.com/liango/p/7122440.html