0001.大数据课程概述与大数据背景知识


02-01-什么是大数据

大数据的应用举例:
1. 电商的推荐系统
存储:大量的订单如何存储
运算:大量的订单如何计算
2. 天气的预报
存储:大量的天气的数据如何存储
运算:大量的天气的数据如何计算
核心问题:
1. 存储:分布式的文件系统:HDFS(Hadoop Distributed File System)
2. 运算:不是算法,分布式的计算:MapReduce、Spark(RDD:弹性分布式数据集)


02-02-数据仓库和大数据

数据仓库就是一个数据库(Oracle、MySQL、MS),一般只做select

搭建数据仓库的过程.png

搭建数据仓库Data Warehouse可以使用传统的Oracle、MySQL来搭建,也可以使用hadoop、spark来搭建。


02-03-OLTP和OLAP

1、OLTP:Online Transaction Processing 联机事务处理,指:(insert、update、delete)--> 事务,传统的关系型数据库解决的问题
2、OLAP:Online Analytic Processing 联机分析处理,一般:只做查询select(分析)


02-04-分布式文件系统的基本思想

  • GFS: Google File System ---- HDFS: Hadoop Distributed File System

    1. 分布式文件系统
    2. 大数据的存储问题
    3. HDFS中,记录数据保存的位置信息(元信息)-----> 采用倒排索引(Reverted Index)
      • 什么是索引?index
        (1) create index 创建索引
        (2) 就是一个目录
        (3) 通过索引可以找到对应的数据
        (4)问题:索引一定可以提高查询的速度吗?
      • 什么是倒排索引?
    4. 演示Demo:以伪分布环境为例
  • MapReduce:分布计算模型,问题来源是:PageRank(网页排名)

  • BigTable:大表 ------ NoSQL数据库:HBase

分布式文件系统的基本思想.png


02-05-什么是机架感知

机架感知的基本思想.png


02-06-什么是倒排索引

什么是索引.png

什么是倒排索引.png


02-07-HDFS的体系架构和Demo演示


02-08-什么是PageRank

Google的向量矩阵.png


02-09-MR编程模型

MapReduce的编程模型.png


02-10-Demo-单词计数WordCount

[ root@ demo11~]# start-yarn. sh 
starting yarn daemons 
starting resourcemanager, logging to /root/training/hadoop-2.7.3/logs/yarn-root-resourcemanager-demol1. out localhost: starting nodemanager, logging to /root/training/hadoop-2.7.3/logs/yarn-root-nodemanager-demo11. out
[root@demo11~]# jps
16164 ResourceManager
16596 Jps
15976 SecondaryNameNode
15772 DataNode
15661 NameNode
16271 NodeManager
[root@demo11~]# hdfs dfs -1s /input 
Found 3 items
-rw-r--r--	1 root supergroup		 204	2018-08-14	11:18	/input/a.xml
-rw-r--r--	1 root supergroup		  60	2018-08-13	23:48	/input/data.txt
-rw-r--r--	1 root supergroup	30826876	2018-08-17	10:19	/input/sales
[root@demo11 ~]# hdfs dfs-cat /input/data.txt 
I love Beijing
I love China
Beijing is the capital of China
[root@demo11 ]# cd training/hadoop-2.7.3/share/hadoop/mapreduce/
[root@demo11 mapreduce]# pwd
/root/training/hadoop-2.7.3/share/hadoop/mapreduce
[root@demo11 mapreduce]# 1s hadoop-mapreduce-examples-2.7.3.jar hadoop-mapreduce-examples-2.7.3.jar
[rootedemol1 mapreduce]# hadoop jar hadoop-mapreduce-examples-2.7.3.jar

An example program must be given as the first argument.
Valid program names are: aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
bbp:A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
dbcount: An example job that count the pageview counts from a database.
distbbp:A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
grep:A map/reduce program that counts the matches of a regex in the input.
join:A job that effects a join over sorted, equally partitioned datasets multifilewc:A job that counts words from several files.
pentomino:A map/reduce tile laying program to find solutions to pentomino problems.
pi:A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
randomtextwriter:A map/reduce program that writes 10GB of random textual data per node.
randomwriter:A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sort:A map/reduce program that sorts the data written by the random writer.
sudoku:A sudoku solver.
teragen: Generate data for the terasort terasort: Run the terasort teravalidate: Checking results of terasort wordcount:A map/reduce program that counts the words in the input files.
wordmean:A map/reduce program that counts the average length of the words in the input files.
wordmedian:A map/reduce program that counts the median length of the words in the input files.
wordstandarddeviation:A map/reduce program that counts the standard deviation of the length of the words in the input files
[root@demo11 mapreduce]# hadoop jar hadoop-mapreduce-examples-2.7.3.jar wordcount /input/data.txt /output/day0829/wc1

可以通过http://192.168.157.11:8088/cluster监控任务的执行(Yarn的web console)

Yarn的web console

![](0001.大数据课程概述与大数据背景知识.assets/web console.png)

[root@demo11 mapreduce]# hdfs dfs -1s /output/day0829/wc1
Found 2 items
-rw-r--r--1 root supergroup  0 2018-08-29 20:57 /output/day0829/wc1/_SUCCESS
-rw-r--r--1 root supergroun 55 2018-08-29 20:57 /output/dav0829/wcl/part-r-00000
[root@demo11 mapreduce]# hdfs dfs -cat /output/day0829/wcl/part-r-00000
Beijing	2
China	2
I		2
capital	1
is		1
love	2
of		1
the		1

02-11-BigTable大表

Oracle表结构和HBase的表结构.png

原文地址:https://www.cnblogs.com/RoyalGuardsTomCat/p/13825013.html