data lake 新式数据仓库

Data lake - Wikipedia https://en.wikipedia.org/wiki/Data_lake

数据湖  

 Azure Data Lake Storage Gen2 预览版简介 | Microsoft Docs https://docs.microsoft.com/zh-cn/azure/storage/data-lake-storage/introduction


Azure Data Lake Storage Gen2 是适用于大数据分析的可高度缩放、具有成本效益的 Data Lake 解决方案。它将大规模执行和经济高效的特点融入到高性能文件系统的功能中,帮助加快见解产生的时间。Data Lake Storage Gen2 扩展了 Azure Blob 存储功能,并且针对分析工作负载进行了优化。存储数据后即可通过现有的 Blob 存储和兼容 HDFS 的文件系统接口访问这些数据,而无需更改程序或复制数据。Data Lake Storage Gen2 是最为全面的可用 Data Lake。

 

大数据高级分析

 

实时分析

Data lake

From Wikipedia, the free encyclopedia
 
 
Jump to navigationJump to search

data lake is a system or repository of data stored in its natural format,[1] usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reportingvisualizationanalytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XMLJSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video). [2]

data swamp is a deteriorated data lake that is either inaccessible to its intended users or is providing little value.[3][4]

Background

James Dixon, then chief technology officer at Pentaho, allegedly coined the term[5] to contrast it with data mart, which is a smaller repository of interesting attributes derived from raw data.[6] In promoting data lakes, he argued that data marts have several inherent problems, such as information siloingPricewaterhouseCoopers said that data lakes could "put an end to data silos.[7] In their study on data lakes they noted that enterprises were "starting to extract and place data for analytics into a single, Hadoop-based repository." HortonworksGoogleOracleMicrosoftZaloniTeradataCloudera, and Amazon now all have data lake offerings. [8]

Examples

One example of technology used to host a data lake is the distributed file system used in Apache Hadoop. Many companies also use cloud storage services such as Azure Data Lake and Amazon S3.[9] There is a gradual academic interest in the concept of data lakes, for instance, Personal DataLake[10] at Cardiff University to create a new type of data lake which aims at managing big data of individual users by providing a single point of collecting, organizing, and sharing personal data.[11] An earlier data lake (Hadoop 1.0) had limited capabilities with its batch oriented processing (MapReduce) and was the only processing paradigm associated with it. Interacting with the data lake meant you had to have expertise in Java with map reduce and higher level tools like Apache Pig and Apache Hive (which by themselves were batch oriented).

Criticism

In June 2015, David Needle characterized "so-called data lakes" as "one of the more controversial ways to manage big data".[12] PricewaterhouseCoopers were also careful to note in their research that not all data lake initiatives are successful. They quote Sean Martin, CTO of Cambridge Semantics,

We see customers creating big data graveyards, dumping everything into HDFS [Hadoop Distributed File System] and hoping to do something with it down the road. But then they just lose track of what’s there. 
The main challenge is not creating a data lake, but taking advantage of the opportunities it presents.[7]

They describe companies that build successful data lakes as gradually maturing their lake as they figure out which data and metadata are important to the organization. One other criticism about the data lake is that the concept is fuzzy and arbitrary. It refers to any tool or data management practice that does not fit into the traditional data warehouse architecture. The data lake has been referred to as a technology such as Hadoop. The data lake has been labeled as a raw data reservoir or a hub for ETL offload. The data lake has been defined as a central hub for self-service analytics. The concept of the data lake has been overloaded with meanings, which puts the usefulness of the term into question.[13]

 
 
原文地址:https://www.cnblogs.com/rsapaper/p/9915365.html