MapReduce英语面试

1——What's Mapreduce.(How does Mapreduce works?)

Mapreduce is a progarmming model to process data process.Mapduce works by breaking the processing into two phases:the map phase and the reduce phase.

Each phase has key-value pairs as input and output,the types of which can be chosen by programmers.(InputFormat).To implenment the Mapredue,we need to specify two functions:map function and reduce funciton.

 2——......

Rather than use build-in java types,Hadoop provides its own sets of basis types that are optimized for network seralization,which we can find it in the package of  org.apache.haoop.io.

 3——Data Flow

A Mapreduce is a unit work that the clients want to be performed:it consits of the input data,the Mapreduce program and the configuration information.Hadoop run the job by dividing it into tasks,of which there are two types:map tasks and reduce tasks.

There are two types of nodes that control the job execution process:a jobtracker and a number of tasktrackers .The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers.Tasktracters run tasks and send progress reports to the job tracker,which keeps a record of overall progress of each job.If a task fails,the jobtracker can reschedule it on a different tasktracker.

Hadoop divides the input to a Mapreduce job into fixed-size pieces called input splits.Hadoop creates one map task for each split,which run the-user difined map function for each record in the split.(For most jobs, a good split size tends to be the size of an HDFS block, 64 MB by default, although this
can be changed for the cluster, or specified when each fileis created.)

  4——HDFS

 When a dataset outgrows the storage capacity of a single physical machine, it becomes necessary  to partition it across a number of separate machines. Filesystems that manage the storage across a network of machines are called distributed filesystems. Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop Distributed Filesystem.

 HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware.

  5——Streaming data access

HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. A dataset is typically generated or copied from source, then various analyses are performed on that dataset over time. Each analysis will involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record.

   6——NameNode and DataNode

 An HDFS cluster has two types of node: a namenode which is the master and a number of datanodes which act as the workers.The namenode manages the
filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log. The namenode also knows the datanodes on which all the blocks for a given file are located, however, it does not  store  block  locations  persistently,  since  this  information  is  reconstructed  from datanodes when the system starts.

Datanodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing.

 For secondaryNameNode its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large.

    7-- Serialization

Serializationis the process of turning structured objects into a byte stream for transmission over a network or for writing to persistent storage.Deserializationis the reverse process of turning a byte stream back into a series of structured objects.

Serialization  appears  in  two  quite  distinct  areas  of  distributed  data  processing:  for interprocess communication and for persistent storage.

原文地址:https://www.cnblogs.com/conie/p/3632429.html