0001.大数据课程概述与大数据背景知识

02-01-什么是大数据

大数据的应用举例：
1. 电商的推荐系统
存储：大量的订单如何存储
运算：大量的订单如何计算
2. 天气的预报
存储：大量的天气的数据如何存储
运算：大量的天气的数据如何计算
核心问题：
1. 存储：分布式的文件系统：HDFS（Hadoop Distributed File System）
2. 运算：不是算法，分布式的计算：MapReduce、Spark（RDD：弹性分布式数据集）

02-02-数据仓库和大数据

数据仓库就是一个数据库（Oracle、MySQL、MS），一般只做select

搭建数据仓库的过程.png

搭建数据仓库Data Warehouse可以使用传统的Oracle、MySQL来搭建，也可以使用hadoop、spark来搭建。

02-03-OLTP和OLAP

1、OLTP：Online Transaction Processing 联机事务处理，指：（insert、update、delete）--> 事务，传统的关系型数据库解决的问题
2、OLAP：Online Analytic Processing 联机分析处理，一般：只做查询select（分析）

02-04-分布式文件系统的基本思想

GFS: Google File System ---- HDFS: Hadoop Distributed File System
1. 分布式文件系统
2. 大数据的存储问题
3. HDFS中，记录数据保存的位置信息（元信息）-----> 采用倒排索引（Reverted Index）
  - 什么是索引？index
    (1) create index 创建索引
    (2) 就是一个目录
    (3) 通过索引可以找到对应的数据
    (4)问题：索引一定可以提高查询的速度吗？
  - 什么是倒排索引？
4. 演示Demo：以伪分布环境为例
MapReduce:分布计算模型，问题来源是：PageRank（网页排名）
BigTable：大表 ------ NoSQL数据库：HBase

分布式文件系统的基本思想.png

02-05-什么是机架感知

机架感知的基本思想.png

02-06-什么是倒排索引

什么是索引.png

什么是倒排索引.png

02-07-HDFS的体系架构和Demo演示

02-08-什么是PageRank

Google的向量矩阵.png

02-09-MR编程模型

MapReduce的编程模型.png

02-10-Demo-单词计数WordCount

[ root@ demo11~]# start-yarn. sh 
starting yarn daemons 
starting resourcemanager, logging to /root/training/hadoop-2.7.3/logs/yarn-root-resourcemanager-demol1. out localhost: starting nodemanager, logging to /root/training/hadoop-2.7.3/logs/yarn-root-nodemanager-demo11. out

[root@demo11~]# jps
16164 ResourceManager
16596 Jps
15976 SecondaryNameNode
15772 DataNode
15661 NameNode
16271 NodeManager

[root@demo11~]# hdfs dfs -1s /input 
Found 3 items
-rw-r--r--	1 root supergroup		 204	2018-08-14	11:18	/input/a.xml
-rw-r--r--	1 root supergroup		  60	2018-08-13	23:48	/input/data.txt
-rw-r--r--	1 root supergroup	30826876	2018-08-17	10:19	/input/sales

[root@demo11 ~]# hdfs dfs-cat /input/data.txt 
I love Beijing
I love China
Beijing is the capital of China

[root@demo11 ]# cd training/hadoop-2.7.3/share/hadoop/mapreduce/
[root@demo11 mapreduce]# pwd
/root/training/hadoop-2.7.3/share/hadoop/mapreduce
[root@demo11 mapreduce]# 1s hadoop-mapreduce-examples-2.7.3.jar hadoop-mapreduce-examples-2.7.3.jar
[rootedemol1 mapreduce]# hadoop jar hadoop-mapreduce-examples-2.7.3.jar

An example program must be given as the first argument.
Valid program names are: aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
bbp:A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
dbcount: An example job that count the pageview counts from a database.
distbbp:A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
grep:A map/reduce program that counts the matches of a regex in the input.
join:A job that effects a join over sorted, equally partitioned datasets multifilewc:A job that counts words from several files.
pentomino:A map/reduce tile laying program to find solutions to pentomino problems.
pi:A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
randomtextwriter:A map/reduce program that writes 10GB of random textual data per node.
randomwriter:A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sort:A map/reduce program that sorts the data written by the random writer.
sudoku:A sudoku solver.
teragen: Generate data for the terasort terasort: Run the terasort teravalidate: Checking results of terasort wordcount:A map/reduce program that counts the words in the input files.
wordmean:A map/reduce program that counts the average length of the words in the input files.
wordmedian:A map/reduce program that counts the median length of the words in the input files.
wordstandarddeviation:A map/reduce program that counts the standard deviation of the length of the words in the input files

[root@demo11 mapreduce]# hadoop jar hadoop-mapreduce-examples-2.7.3.jar wordcount /input/data.txt /output/day0829/wc1

可以通过http://192.168.157.11:8088/cluster监控任务的执行（Yarn的web console）

Yarn的web console

![](0001.大数据课程概述与大数据背景知识.assets/web console.png)

[root@demo11 mapreduce]# hdfs dfs -1s /output/day0829/wc1
Found 2 items
-rw-r--r--1 root supergroup  0 2018-08-29 20:57 /output/day0829/wc1/_SUCCESS
-rw-r--r--1 root supergroun 55 2018-08-29 20:57 /output/dav0829/wcl/part-r-00000

[root@demo11 mapreduce]# hdfs dfs -cat /output/day0829/wcl/part-r-00000
Beijing	2
China	2
I		2
capital	1
is		1
love	2
of		1
the		1