[zz]Zanopia – Stateless application, database & storage architecture

Comparing Scality RING Object Store & Hadoop HDFS file system

It’s a question that I get a lot so I though let’s answer this one here so I can point people to this blog post when it comes out again!

So first, introduction,

What are Hadoop and HDFS?

Hadoop

Apache Hadoop is a software framework that supports data-intensive distributed applications. It’s open source software released under the Apache license. It can work with thousands of nodes and petabytes of data and was significantly inspired by Google’s MapReduce and Google File System (GFS) papers.

HDFS

Hadoop was not fundamentally developed as a storage platform but since data mining algorithms like map/reduce work best when they can run as close to the data as possible, it was natural to include a storage component.

This storage component does not need to satisfy generic storage constraints, it just needs to be good at storing data for map/reduce jobs for enormous datasets; and this is exactly what HDFS does.

About Scality RING object store

About Scality

Our core RING product is a software-based solution that utilizes commodity hardware to create a high performance, massively scalable object storage system.

Our technology has been designed from the ground up as a multi petabyte scale tier 1 storage system to serve billions of objects to millions of users at the same time.

We did not come from the backup or CDN spaces

Surprisingly for a storage company, we came from the anti-abuse email space for internet service providers.

Why we developed it?

Scality RING object store architecture
The initial problem our technology was born to solve is the storage of billions of emails – that is: highly transactional data, crazy IOPS demands and a need for an architecture that’s flexible and scalable enough to handle exponential growth. Yes, even with the likes of Facebook, flickr, twitter and youtube, emails storage still more than doubles every year and it’s accelerating!

Rather than dealing with a large number of independent storage volumes that must be individually provisioned for capacity and IOPS needs (as with a file-system based architecture), RING instead mutualizes the storage system. Essentially, capacity and IOPS are shared across a pool of storage nodes in such a way that it is not necessary to migrate or rebalance users should a performance spike occur. This removes much of the complexity from an operation point of view as there’s no longer a strong affinity between where the user metadata is located and where the actual content of their mailbox is.

Another big area of concern is under utilization of storage resources, it’s typical to see less than half full disk arrays in a SAN array because of IOPS and inodes (number of files) limitations. We designed an automated tiered storage to takes care of moving data to less expensive, higher density disks according to object access statistics as multiple RINGs can be composed one after the other or in parallel. For example using 7K RPM drives for large objects and 15K RPM or SSD drives for small files and indexes. In this way, we can make the best use of different disk technologies, namely in order of performance, SSD, SAS 10K and terabyte scale SATA drives.

To remove the typical limitation in term of number of files stored on a disk, we use our own data format to pack object into larger containers. This actually solves multiple problems:

write IO load is more linear, meaning much better write bandwidth
each disk or volume is accessed through a dedicated IO daemon process and is isolated from the main storage process; if a disk crashes, it doesn’t impact anything else
billions of files can be stored on a single disk

Comparison matrix

Let’s compare both system in this simple table:

	Hadoop HDFS	Scality RING
Architecture	Centralized around a name node that acts as a central metadata server. Any number of data nodes.	Fully distributed architecture using consistent hashing in a 20 bytes (160 bits) key space. Each node server runs the same code.
Single Point of Failure	Name node is a single point of failure, if the name node goes down, the filesystem is offline.	No single point of failure, metadata and data are distributed in the cluster of nodes.
Clustering/nodes	Static configuration of name nodes and data nodes.	Peer to Peer algorithm based on CHORD designed to scale past thousands of nodes. Complexity of the algorithm is O(log(N)), N being the number of nodes. Nodes can enter or leave while the system is online.
Replication model	Data is replicated on multiple nodes, no need for RAID.	Data is replicated on multiple nodes, no need for RAID.
Disk Usage	Objects are stored as files with typical inode and directory tree issues.	Objects are stored with an optimized container format to linearize writes and reduce or eliminate inode and directory tree issues.
Replication policy	Global setting.	Per object replication policy, between 0 and 5 replicas. Replication is based on projection of keys across the RING and does not add overhead at runtime as replica keys can be calculated and do not need to be stored in a metadata database.
Rack aware	Rack aware setup supported in 3 copies mode.	Rack aware setup supported.
Data center aware	Not supported	Yes, including asynchronous replication
Tiered storage	Not supported	Yes, rings can be chained or used in parallel. Plugin architecture allows the use of other technologies as backend. For example dispersed storage or ISCSI SAN.

Conclusion – Domain Specific Storage?

The FS part in HDFS is a bit misleading, it cannot be mounted natively to appear as a POSIX filesystem and it’s not what it was designed for. As a distributed processing platform, Hadoop needs a way to reliably and practically store the large dataset it need to work on and pushing the data as close as possible to each computing unit is key for obvious performance reasons.

As I see it, HDFS was designed as a domain specific storage component for large map/reduce computations. Its usage can possibly be extended to similar specific applications.

Scality RING can also be seen as domain specific storage; our domain being unstructured content: files, videos, emails, archives and other user generated content that constitutes the bulk of the storage capacity growth today.

Scality RING and HDFS share the fact that they would be unsuitable to host a MySQL database raw files, however they do not try to solve the same issues and this shows in their respective design and architecture.