MapReduce英语面试

1——What's Mapreduce.(How does Mapreduce works?)

Mapreduce is a progarmming model to process data process.Mapduce works by breaking the processing into two phases:the map phase and the reduce phase.

Each phase has key-value pairs as input and output,the types of which can be chosen by programmers.(InputFormat).To implenment the Mapredue,we need to specify two functions:map function and reduce funciton.

2——......

Rather than use build-in java types,Hadoop provides its own sets of basis types that are optimized for network seralization,which we can find it in the package of org.apache.haoop.io.

3——Data Flow

A Mapreduce is a unit work that the clients want to be performed:it consits of the input data,the Mapreduce program and the configuration information.Hadoop run the job by dividing it into tasks,of which there are two types:map tasks and reduce tasks.

There are two types of nodes that control the job execution process：a jobtracker and a number of tasktrackers .The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers.Tasktracters run tasks and send progress reports to the job tracker,which keeps a record of overall progress of each job.If a task fails,the jobtracker can reschedule it on a different tasktracker.

Hadoop divides the input to a Mapreduce job into fixed-size pieces called input splits.Hadoop creates one map task for each split,which run the-user difined map function for each record in the split.(For most jobs, a good split size tends to be the size of an HDFS block, 64 MB by default, although this
can be changed for the cluster, or specified when each fileis created.)

4——HDFS

When a dataset outgrows the storage capacity of a single physical machine, it becomes necessary to partition it across a number of separate machines. Filesystems that manage the storage across a network of machines are called distributed filesystems. Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop Distributed Filesystem.

HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware.

5——Streaming data access

HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. A dataset is typically generated or copied from source, then various analyses are performed on that dataset over time. Each analysis will involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record.

6——NameNode and DataNode

An HDFS cluster has two types of node: a namenode which is the master and a number of datanodes which act as the workers.The namenode manages the
filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log. The namenode also knows the datanodes on which all the blocks for a given file are located, however, it does not store block locations persistently, since this information is reconstructed from datanodes when the system starts.

Datanodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing.

For secondaryNameNode its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large.

7-- Serialization

Serializationis the process of turning structured objects into a byte stream for transmission over a network or for writing to persistent storage.Deserializationis the reverse process of turning a byte stream back into a series of structured objects.

Serialization appears in two quite distinct areas of distributed data processing: for interprocess communication and for persistent storage.