谷歌新系统Dremel让大数据处理更加便捷

译者:psychocoder

Mike Olson is one of the main brains behind the Hadoop movement. But even he looks toward the new breed of “Big Data” software used inside Google. Photo: Wired.com/Jon Snyder

马克·奥尔森是Hadoop运动背后的主要人物之一。但是连他都开始向谷歌内部正在孕育并使用的“大数据”工具看齐。图片来源:Wired.com/Jon Snyder

Mike Olson runs a company that specializes in the world’s hottest software. He’s the CEO of Cloudera, a Silicon Valley startup that deals in Hadoop, an open source software platform based on tech thatturned Google into the most dominant force on the web.

马克·奥尔森管理的一家公司专注于当前世界上最火的软件。他是Clodera公司的CEO,这家硅谷初创公司经营着开源软件平台Hadoop,它的基础,正是谷歌赖以统治网络(搜索)世界的技术。

Hadoop is expected to fuel a $813 million software market by the year 2016. But even Olson says it’s already old news.

外界预期到2016年,将有超过8.13亿美元的软件运行在Hadoop上。但是现在就连奥尔森自己都说这个消息过时了。

Hadoop sprung from two research papers Google published in late 2003 and 2004. One described the Google File System, a way of storing massive amounts of data across thousands of dirt-cheap computer servers, and the other detailed MapReduce, which pooled the processing power inside all those servers and crunched all that data into something useful. Eight years later, Hadoop is widely used across the web, for data analysis and all sorts of other number-crunching tasks. But Google has moved on.

Hadoop起源自谷歌在2003年底和2004年发表的两篇学术论文。第一篇介绍了谷歌的文件系统,将海量的数据保存在上千台普通廉价的PC上;第二篇论文介绍了MapReduce算法,将所有服务器中的处理器有效的利用起来计算(保存在谷歌文件系统中的海量)数据并得到想要的结果。八年之后,Hadoop得到了广泛的使用,(应用范围)从数据分析到各种这样的数值计算任务等等。但是谷歌并没有停下自己的脚步。

In 2009, the web giant started replacing GFS and MapReduce with new technologies, and Mike Olson will tell you that these technologies are where the world is going. “If you want to know what the large-scale, high-performance data processing infrastructure of the future looks like, my advice would be to read the Google research papers that are coming out right now,” Olson said during a recent panel discussion alongside Wired.

从2009年开始,搜索巨人开始使用新技术替换现有的GFS和MapReduce,而麦克·奥尔森会告诉你这些技术将是未来的发展方向。“如果你想知道未来的大规模、高性能数据处理架构是什么样子的,我的建议是现在就去读谷歌的研究论文,”奥尔森在最近的一次专题讨论会中告诉连线杂志。

Since the rise of Hadoop, Google has published three particularly interesting papers on the infrastructure that underpins its massive web operation. One details Caffeine, thesoftware platform that builds the index for Google’s web search engine. Another shows off Pregel, a “graph database” designed to map the relationships between vast amounts of online information. But the most intriguing paper is the one that describes a tool called Dremel.

在Hadoop兴起之后,谷歌发表了三篇值得注意的文章,内容关于支持谷歌大规模网页操作的底层架构。一篇详细介绍了Caffeine,谷歌网络搜索引擎索引构建平台。第二篇关于Pregel,一个用于映射大量线上信息之间关系的“图数据库”。最吸引人的是一篇介绍Dremel工具的文章。

“If you had told me beforehand me what Dremel claims to do, I wouldn’t have believed you could build it,” says Armando Fox, a professor of computer science at the University of California, Berkeley who specializes in these sorts of data-center-sized software platforms.

“如果你之前告诉我Dremel声称能做到的功能,我不太会相信你会实现它,”加州大学伯克利分校计算机系教授Armando Fox说。Fox教授专门研究数据中心级的软件平台。

Dremel is a way of analyzing information. Running across thousands of servers, it lets you “query” large amounts of data, such as a collection of web documents or a library of digital books or even the data describing millions of spam messages. This is akin to analyzing a traditional database using SQL, the Structured Query Language that has been widely used across the software world for decades. If you have a collection of digital books, for instance, you could run an ad hoc query that gives you a list of all the authors — or a list of all the authors who cover a particular subject.

Dremel是一种分析数据的(新)方法。它运行在上千台服务器上,让你能够在大数据上——例如网页文档集合、数字图书馆、百万规模的垃圾信息等——执行“查询”操作。这有点类似于过去传统的数据库上执行SQL操作,过去几十年SQL(结构化查询语言)得到了广泛的应用。比如说你有一个数字图书的集合,那么你可以自己建立一个查询,返回给你所有作者的名单,或者涉及某个特定领域的作者的列表。

“You have a SQL-like language that makes it very easy to formulate ad hoc queries or recurring queries — and you don’t have to do any programming. You just type the query into a command line,” says Urs Hölzle, the man who oversees the Google infrastructure.

“这是一个类似SQL风格的语言,让你能够在不编程的前提下轻松的定义(你需要的)特定的查询或重复的查询。你只需要把查询(命令)输入命令行,”管理谷歌基础架构的Urs Hölzle说。

The difference is that Dremel can handle web-sized amounts of data at blazing fast speed. According to Google’s paper, you can run queries on multiple petabytes — millions of gigabytes — in a matter of seconds.

(与SQL的)不同之处在于Dremal在极快的时间内处理像网页集合这样规模的数据。谷歌的论文中给出的数据说你能够在几秒钟之内查询数PB的数据(PB等于一百万GB)。

Hadoop already provides tools for running SQL-like queries on large datasets. Sister projects such asPig and Hive were built for this very reason. But with Hadoop, there’s lag time. It’s a “batch processing” platform. You give it a task. It takes a few minutes to run the task — or a few hours. And then you get the result. But Dremel was specifically designed for instant queries.

Hadoop已经提供了相应工具,能够在大数据集上运行类SQL查询。Hadoop的姊妹项目asPig和Hive就是专门为这个目的而建立。但是Hadoop有一个延迟时间。它是一个用来进行“批处理”的平台。你扔给它一个任务,它需要几分钟或几个小时来运行,之后你才能拿到结果。 Dremel,是专门针对即时查询的。

“Dremel can execute many queries over such data that would ordinarily require a sequence of MapReduce jobs, but at a fraction of the execution time,” reads Google’s Dremel paper. Hölzle says it can run a query on a petabyte of data in about three seconds.

“Dremel能够在大数据上同时执行多个查询操作。以前则需要写一系列的MapReduce任务,运行时间也比Dremel多得多。Dremel在一个PB级别的数据上进行查询只需要三秒钟。” Urs Hölzle援引谷歌Dremel论文(中的数据)说。

According to Armando Fox, this is unprecedented. Hadoop is the centerpiece of the “Big Data” movement, a widespread effort to build tools that can analyze extremely large amounts of information. But with today’s Big Data tools, there’s often a drawback. You can’t quite analyze the data with the speed and precision you expect from traditional data analysis or “business intelligence” tools. But with Dremel, Fox says, you can.

Armando Fox表示这是史无前例的。Hadoop是“大数据”时代的杰作,用来构建分析超大规模信息的工具。但是现在的大数据工具往往存在一些缺点。你不能指望在大数据(工具)上的查询能够达到传统数据库或商业智能工具的精度和速度。但是Fox说Dremel将能做到这一点。

“They managed to combine large-scale analytics with the ability to really drill down into the data, and they’ve done it in a way that I wouldn’t have thought was possible,” he says. “The size of the data and the speed with which you can comfortably explore the data is really impressive. People have done Big Data systems before, but before Dremel, no one had really done a system that was that big and that fast.

“他们(的工作)既能进行大规模的分析有能够深入的查看数据,这是我以前觉得不可能的事情,”他说,“能够处理的数据的规模和处理数据的时间让人印象深刻。以前人们也开发过不同的大数据系统,但是还没有哪个系统能够像Dremel一样能够如此快速的处理如此多的数据。”

“Usually, you have to do one or the other. The more you do one, the more you have to give up on the other. But with Dremel, they did both.”

“一般来说,(速度和规模)你只能二选一。侧重这边就要放弃那边。但是Dremel做到了两者兼顾。”

According to Google’s paper, the platform has been used inside Google since 2006, with “thousands” of Googlers using it to analyze everything from the software crash reports for various Google services to the behavior of disks inside the company’s data centers. Sometimes, the tool is used with tens of servers, sometime with thousands.

从论文中看出早在2006年这个系统就已经在谷歌内部使用了,“数千个”谷歌员工用它来分析从软件崩溃报告、各种谷歌服务数据到数据中心内部硬盘行为数据等所有事情。这个系统经常在数十台甚至数千台机器上运行。

Despite Hadoop’s undoubted success, Cloudera’s Mike Olson says that the companies and developers who built the platform were rather slow off the blocks. And we’re seeing the same thing with Dremel. Google published the Dremel paper in 2010, but we’re still a long way from seeing the platform mimicked by developers outside the company. A team of Israeli engineers is building a clone they called OpenDremel, though one of these developers, David Gruzman, tells us that coding is only just beginning again after a long hiatus.

Hadoop的成功是无可否认的,但是Clodera CEO迈克·奥尔森觉得(跟谷歌相比)开发这个平台的公司和开发人员有些落后了。在Dremel上我们看到了同样的事情。谷歌在2010年发表了Dremel,但是我们仍然需要很长的时间才能看到由第三方开发人员仿制的系统出来。一个来自以色列的工程团队正在构造一个叫做OpenDremel的类似系统,虽然开发人员之一David Gruzman说他们中断了很长时间,现在才开始编码。

Mike Miller — an affiliate professor of particle physics at the University of Washington and the chief scientist of Cloudant, a company that’s tackling many of the same data problems Google has faced over the years — is amazed we haven’t seen some big-name venture capitalist fund a startup dedicated to reverse-engineering Dremel.

迈克·米勒是华盛顿大学粒子物理学合聘教授,同时也是Cloudant公司首席科学家。这家公司需要解决的数据问题与谷歌这些年遇到的问题有很多相似点。我们很惊讶一家旨在逆向Dremel的初创公司得到了若干知名风投的支持。

That said, you can use Dremel today — even if you’re not a Google engineer. Google now offers a Dremel web service it calls BigQuery. You can use the platform via an online API, or application programming interface. Basically, you upload your data to Google, and it lets you run queries on its internal infrastructure.

即使你不是谷歌员工 ,如今可以使用Dremel了。谷歌现在提供了一个基于Dremel的网页服务BigQuery。你可以通过网页API使用这个平台。基本上只要上传了你的数据就可以利用谷歌的内部架构来执行查询了。

This is part of a growing number of cloud services offered by the company. First, it let you run build, run, and host entire applications atop its infrastructure using a service called Google App Engine, and now it offers various other utilities that run atop this same infrastructure, including BigQuery and the Google Compute Engine, which serves up instant access to virtual servers.

这是谷歌提供的越来越多的云服务的一部分。起初谷歌允许你通过GAE在谷歌的架构上编译、运行整个应用程序,而现在增加了对包括BigQuery和Google Compute Engine(用于即时访问虚拟服务器)在内的大量工具。

The rest of the world may lag behind Google. But Google is bringing itself to the rest of the world.

谷歌走在了世界前列。而他正在促进世界上其它的公司进步。

Cade Metz

Cade Metz is the editor of Wired Enterprise. Got a NEWS TIP related to this story -- or to anything else in the world of big tech? Please e-mail him: cade_metz at wired.com.

Cade Metz是连线企业版的编辑。对本文感兴趣?给他发电邮吧: cade_metz at wired.com。

Read more by Cade Metz
Follow @cademetz on Twitter.
原文地址:https://www.cnblogs.com/renly/p/2874213.html