别老扯什么Hadoop了，你的数据根本不够大

本文原名“Don't use Hadoop when your data isn't that big ”，出自有着多年从业经验的数据科学家Chris Stucchio，纽约大学柯朗研究所博士后，搞过高频交易平台，当过创业公司的CTO，更习惯称自己为统计学者。对了，他现在自己创业，提供数据分析、推荐优化咨询服务，他的邮件是：stucchio@gmail.com 。

“你有多少大数据和Hadoop的经验？”他们问我。我一直在用Hadoop，但很少处理几TB以上的任务。我基本上只是一个大数据新手——知道概念，写过代码，但是没有大规模经验。

接下来他们会问：“你能用Hadoop做简单的group by和sum操作吗？”我当然会，但我会说需要看看具体文件格式。

他们给我一个U盘，里面有所有的数据，600MB，对，他们所有的数据。不知道为什么，我用pandas.read_csv（Pandas是一种Python数据分析库）而不是Hadoop完成了这个任务后，他们显得很不满意。

Hadoop其实是挺局限的。它无非是运行某个通用的计算，用SQL伪代码表示就是： SELECT G(...) FROM table GROUP BY F(...) 你只能改变G和F操作，除非要在中间步骤做性能优化（这可不怎么好玩！）。其他一切都是死的。

（关于MapReduce，之前作者写过一篇“41个词讲清楚MapReduce”，可以参考。）

Hadoop里，所有计算都必须按照一个map、一个group by、一个aggregate或者这种计算序列来写。这和穿上紧身衣一样，多憋得慌啊。许多计算用其他模型其实更适合。忍受紧身衣的唯一原因就是，可以扩展到极大极大的数据集。可你的数据集实际上很可能根本远远够不上那个数量级。

可是呢，因为Hadoop和大数据是热词，世界有一半的人都想穿上紧身衣，即使他们根本不需要。

可我的数据有好几百MB呢！Excel都装不下

对Excel很大可不是什么大数据。有很多好工具——我喜欢用的是基于Numpy的Pandas。它可以将几百MB数据以高效的向量化格式加载到内存，在我已经3年的老笔记本上，一眨眼的功夫，Numpy就能完成1亿次浮点计算。Matlab和R也是很棒的工具。

数百MB数据一般用一个简单的Python脚本逐行读取文件、处理，然后写到了一个文件就行了。

可我的数据有10G呢！

我刚买了一台笔记本电脑。16G内存花了141.98美元，256GB SSD多收200美元。另外，如果在Pandas里加载一个10GB的csv文件，实际在内存里并没有那么大——你可以将 “17284932583” 这样的数值串存为4位或者8位整数，“284572452.2435723”存为8位双精度。

最差情况下，你还可以不同时将所有数据都一次加载到内存里。

可我的数据有100GB/500GB/1TB！

一个2T的硬盘才94.99美元，4T是169.99。买一块，加到桌面电脑或者服务器上，然后装上PostgreSQL。

Hadoop的适用范围远小于SQL和Python脚本

从计算的表达能力来说，Hadoop比SQL差多了。Hadoop里能写的计算，在SQL或者简单的Python脚本都可以更轻松地写出来。

SQL是直观的查询语言，没有太多抽象，业务分析师和程序员都很常用。SQL查询往往非常简单，而且一般也很快——只要数据库正确地做了索引，要花几秒钟的查询都不太多见。

Hadoop没有任何索引的概念，它只知道全表扫描。而且Hadoop抽象层次太多了——我之前的项目尽在应付Java内存错误、内存碎片和集群竞用了，实际的数据分析工作反而没了时间。

如果你的数据结构不是SQL表的形式（比如纯文本、JSON、二进制），一般写一小段Python或者Ruby脚本按行处理更直接。保存在多个文件里，逐个处理即可。SQL不适用的情况下，从编程来说Hadoop也没那么糟糕，但相比Python脚本仍然没有什么优势。

除了难以编程，Hadoop还一般总是比其他技术方案要慢。只要索引用得好，SQL查询非常快。比如要计算join，PostgreSQL只需查看索引（如果有），然后查询所需的每个键。而Hadoop呢，必须做全表扫描，然后重排整个表。排序通过多台机器之间分片可以加速，但也带来了跨多机数据流处理的开销。如果要处理二进制文件，Hadoop必须反复访问namenode。而简单的Python脚本只要反复访问文件系统即可。

可我的数据超过了5TB！

你的命可真苦——只能苦逼地折腾Hadoop了，没有太多其他选择（可能还能用许多硬盘容量的高富帅机器来扛），而且其他选择往往贵得要命（脑海中浮现出IOE等等字样……）。

用Hadoop唯一的好处是扩展。如果你的数据是一个数TB的单表，那么全表扫描是Hadoop的强项。此外的话，请关爱生命，尽量远离Hadoop。它带来的烦恼根本不值，用传统方法既省时又省力。

附注：Hadoop也是不错的工具

我可不是成心黑Hadoop啊。其实我自己经常用Hadoop来完成其他工具无法轻易完成的任务。（我推荐使用Scalding，而不是Hive或者Pig，因为你可以用Scala语言来写级联Hadoop任务，隐藏了MapReduce底层细节。）我本文要强调的是，用Hadoop之前应该三思而行，别500MB数据这样的蚊子，你也拿Hadoop这样的大炮来轰。

Chris从数据体积上分析了你的数据是否称得上大数据，是否真的需要使用大数据技术，然而衡量大数据的因素还有Velocity、Variety以及Value，下面我们就一起看MongoDB分享的“大数据除大以外的东西”，下为译文：

MongoHQ：不要因为大数据背后的利益而贬低其他途径

“大数据”，套用《银河系漫游指南》里的经典语录就是“is Big. You won’t believe how vastly, hugely, mind-bogglingly big it is. I mean you may think there’s a lot of data in Wikipedia but that’s just peanuts to Big Data”。这也是许多人在碰到大数据时走入的误区——他们首先假设自己必须使用大数据技术处理，然而我们离大数据还差很远，那么大数据是如何得来的？

回溯20世纪90年代，人们认识到数字化的存储数据比用纸要廉价的多，当一个东西便宜到一定的地步时，它就成为一个必然的选项。人类就会出于本能的去储存所有数据，因为“未来我们可能需要它们”，而且储存已经这么便宜了，为什么不做呢？

而从1990年美国科学家一篇名为 “Saving All The Bits”的文章中发现，那个时候科学家已经不得不面对保存所有数据的挑战，Peter Denning解释了NASA保存所有哈勃太空望远镜产生数据面临的挑战：该设备每天产生的数据需要2500张光盘来存放，这个速度不仅淹没了网络和存储设备的性能，同样还超出了“人类的理解能力”。但是请不要忽视一点，随着储存技术和经济状况的发展，这2500张光盘只等价于当下100美元左右的硬盘，而且我们似乎也并不需要储存一个太空望远镜产生的如此大量数据。

大数据的有限价值

今天我们几乎可以存储任何具有业务目的明显的数据，比如信用卡销售及问卷调查。同时，我们还可以存储所有业务目的不明显的数据，比如：用户在一个网页上的行为、电缆接线盒中用户观看的TV频道、借助物理网开关灯或者门的行为。但是从价值上看，后一类行为的价值无疑很低。

一笔信用卡交易包含了很多数据，比如：人的信息、地理位置、价值等。在销售周期中，你会很自然的捕捉这些数据。然而用户在一个网站上产生的行为显然不会那么有价值，你可能收集到用户访问的URL、阅读某个页面花费的时间，但是这些记录的价值显然不如信用卡交易那么丰富。当然如果你要给你的用户分类时，这些记录还是拥有一定价值的。

然而当下存储的成本已经越来越少了，你的数据越多，你就可以从数据分析趋势中获得更多的价值。每条TV频道转换的信息确实无关紧要，但是如果你把这些数据与调度机广告数据放到一起将其视为一个聚合数据集，你将可以清楚的知晓用户的行为，这些数据将给广告者和程序设计人员提供有价值的见解。

同样，智能家庭系统中收集到的信息价值就更低了，你可能只会得到一些事件和状态信息，同时系统可能产生大量的数据，价值必须通过大量的筛选、过滤等处理才能体现。大数据最大的挑战就是从大量的碎片项中获取信息，也可能是使用许多具有丰富价值的数据做依托，然后从中剥丝抽茧，寻找真知。需要注意的是，这并不是大海捞针，而是从一堆针中给一些针定性。

Hot Data vs. Big Data

造成需要大数据的原因是，你不仅拥有大量的数据，同样拥有大量访问这些数据的请求，而Big Data看起来能满足这个需求。

BigData的数据更倾向于冷数据，也就是你不会经常访问的数据，除了分析之外可能不会再次被使用。它可能很快被新鲜的冷数据代替，而新的冷数据又会产生新的分析，但是Big Data的范围需要与热数据分开，因为将两个需求混合得到的结果必然低于预期，这样一来冷数据与热数据的分析必然都差强人意。无论如何区分冷热数据都是个好的思想，不管是存储还是应用程序都应该区别对待。但是总有一些人不分场景为用户提供Big Data这个“仙丹”。

因此，请重视你的数据，分清楚数据的类型，以业务为需求，不必要将所有的数据混合到一起去打造1个大数据。

数据科学家Chris Stucchio英文原文如下：

Don't use Hadoop - your data isn't that big

"So, how much experience do you have with Big Data and Hadoop?" they asked me. I told them that I use Hadoop all the time, but rarely for jobs larger than a few TB. I'm basically a big data neophite - I know the concepts, I've written code, but never at scale.

The next question they asked me. "Could you use Hadoop to do a simple group by and sum?" Of course I could, and I just told them I needed to see an example of the file format.

They handed me a flash drive with all 600MB of their data on it (not a sample, everything). For reasons I can't understand, they were unhappy when my solution involved pandas.read_csv rather than Hadoop.

Hadoop is limiting. Hadoop allows you to run one general computation, which I'll illustrate in pseudocode:

Scala-ish pseudocode:

collection.flatMap( (k,v) => F(k,v) ).groupBy( _._1 ).map( _.reduce( (k,v) => G(k,v) ) )

SQL-ish pseudocode:

SELECT G(...) FROM table GROUP BY F(...)

Or, as I explained a couple of years ago:

Goal: count the number of books in the library.

Map: You count up the odd-numbered shelves, I count up the even numbered shelves. (The more people we get, the faster this part goes. )

Reduce: We all get together and add up our individual counts.

The only thing you are permitted to touch is F(k,v) and G(k,v), except of course for performance optimizations (usually not the fun kind!) at intermediate steps. Everything else is fixed.

It forces you to write every computation in terms of a map, a group by, and an aggregate, or perhaps a sequence of such computations. Running computations in this manner is a straightjacket, and many calculations are better suited to some other model. The only reason to put on this straightjacket is that by doing so, you can scale up to extremely large data sets. Most likely your data is orders of magnitude smaller.

But because "Hadoop" and "Big Data" are buzzwords, half the world wants to wear this straightjacket even if they don't need to.

But my data is hundreds of megabytes! Excel won't load it.

Too big for Excel is not "Big Data". There are excellent tools out there - my favorite is Pandas which is built on top of Numpy. You can load hundreds of megabytes into memory in an efficient vectorized format. On my 3 year old laptop, it takes numpy the blink of an eye to multiply 100,000,000 floating point numbers together. Matlab and R are also excellent tools.

Hundreds of megabytes is also typically amenable to a simple python script that reads your file line by line, processes it, and writes to another file.

But my data is 10 gigabytes!

I just bought a new laptop. The 16GB of ram I put in cost me $141.98 and the 256gb SSD was $200 extra (preinstalled by Lenovo). Additionally, if you load a 10 GB csv file into Pandas, it will often be considerably smaller in memory - the result of storing the numerical string "17284932583" as a 4 or 8 byte integer, or storing "284572452.2435723" as an 8 byte double.

Worst case, you might actually have to not load everything into ram simultaneously.

But my data is 100GB/500GB/1TB!

A 2 terabyte hard drive costs $94.99, 4 terabytes is $169.99. Buy one and stick it in a desktop computer or server. Then install Postgres on it.

Hadoop << SQL, Python Scripts

In terms of expressing your computations, Hadoop is strictly inferior to SQL. There is no computation you can write in Hadoop which you cannot write more easily in either SQL, or with a simple Python script that scans your files.

SQL is a straightforward query language with minimal leakage of abstractions, commonly used by business analysts as well as programmers. Queries in SQL are generally pretty simple. They are also usually very fast - if your database is properly indexed, multi-second queries will be uncommon.

Hadoop does not have any conception of indexing. Hadoop has only full table scans. Hadoop is full of leaky abstractions - at my last job I spent more time fighting with java memory errors, file fragmentation and cluster contention than I spent actually worrying about the mostly straightforward analysis I wanted to perform.

If your data is not structured like a SQL table (e.g., plain text, json blobs, binary blobs), it's generally speaking straightforward to write a small python or ruby script to process each row of your data. Store it in files, process each file, and move on. Under circumstances where SQL is a poor fit, Hadoop will be less annoying from a programming perspective. But it still provides no advantage over simply writing a Python script to read your data, process it, and dump it to disk.

In addition to being more difficult to code for, Hadoop will also nearly always be slower than the simpler alternatives. SQL queries can be made very fast by the judicious use of indexes - to compute a join, PostgreSQL will simply look at an index (if present) and look up the exact key that is needed. Hadoop requires a full table scan, followed by re-sorting the entire table. The sorting can be made faster by sharding across multiple machines, but on the other hand you are still required to stream data across multiple machines. In the case of processing binary blobs, Hadoop will require repeated trips to the namenode in order to find and process data. A simple python script will require repeated trips to the filesystem.

But my data is more than 5TB!

Your life now sucks - you are stuck with Hadoop. You don't have many other choices (big servers with many hard drives might still be in play), and most of your other choices are considerably more expensive.

The only benefit to using Hadoop is scaling. If you have a single table containing many terabytes of data, Hadoop might be a good option for running full table scans on it. If you don't have such a table, avoid Hadoop like the plague. It isn't worth the hassle and you'll get results with less effort and in less time if you stick to traditional methods.

P.S. The Sales Pitch

I'm building a startup aiming to provide data analysis (big and small) and realtime recommendations and optimization to publishers and e-commerce sites. Go check it out.

P.P.S. Hadoop is a fine tool

I don't intend to hate on Hadoop. I use Hadoop regularly for jobs I probably couldn't easily handle with other tools. (Tip: I recommend using Scalding rather than Hive or Pig. Scalding lets you use Scala, which is a decent programming language, and makes it easy to write chained Hadoop jobs without hiding the fact that it really is mapreduce on the bottom.) Hadoop is a fine tool, it makes certain tradeoffs to target certain specific use cases. The only point I'm pushing here is to think carefully rather than just running Hadoop on The Cloud in order to handle your 500mb of Big Data at an Enterprise Scale.