Facebook MyRocks at MariaDB

Recently my colleague Rasmus Johansson announced that MariaDB is adding support for the Facebook MyRocks storage engine. Today I’m going to share a bit more on what that means for MariaDB users. Members of the Facebook Database Engineering team helped us answer some questions we think our community will have about MyRocks.

Benefits of MariaDB Server’s Extensible Architecture

Before discussing specifics of MyRocks, new readers may benefit from a description of MariaDB Serverarchitecture, which is extensible at every layerincluding the storage layer. This means users and the community can add functionality to meet unique needs. Community contributions are one of MariaDB’s greatest advantages over other databases, and a big reason for us becoming the fastest growing open source database in the marketplace.

Openness in the storage layer is especially important because being able to use the right storage engine for the right use case ensures better performance optimization. Both mysql and MariaDB support InnoDB - a well known, general purpose storage engine. But InnoDB is not suited to every use case, so the MariaDB engineering team is extending support for additional storage engines, including Facebook’s MyRocks for workloads requiring greater compression and IO efficiency, and MariaDBColumnStore (currently inbeta), which will provide faster time-to-insight with Massively Parallel Execution (MPP).

Facebook MyRocks for MariaDB

When searching for a storage engine that could give greater performance for web scale type applications, MyRocks was an obvious choice because of its superior handling of data compression and IO efficiency. Besides that, its LSM architecture allows for very efficient data ingestion, like read-free replication slaves, or fast bulk data loading.

As we add support for new storage engines, many of our current users may ask, “What happens to MariaDB’s support for InnoDB? Do I have to migrate?” Of course not! We have no plans to abandon InnoDB. InnoDB is a proven storage engine and we expect it to continue to be used by MariaDB users. But we do expect that deployments that need highest possible efficiency will opt for MyRocks because of its performance gains and IO efficiency. Over time, as MyRocks matures we expect it will become appropriate for even more use cases.

The first MariaDB version of MyRocks will be available in a release candidate of MariaDB Server 10.2 coming this winter. Our goal is for MyRocks to work with all MariaDB features, but some of them, like optimistic parallel replication, may not work in the first release. MariaDB is an open source project that follows the "release often, release early" approach, so our goal is to first make a release that meets core requirements, and then add support for special cases in subsequent releases.

Now let’s move onto my discussion with Facebook’s Database Engineering team!

Can you tell us a bit about the history of RocksDB at Facebook? In 2012, we started to build an embedded storage engine optimized for flash-based SSD, by forking LevelDB. The fork became RocksDB, which was open-sourced on November 2013 [1] . After RocksDB proved to be an effective persistent key-value store for SSD, we enhanced RocksDB for other platforms. We improved its performance on DRAM in 2014 and on hard drives in 2015, two platforms with production use cases now.

Over the past few years, we've introduced numerous features and improvements. To name a few, we built compaction filter and merge operator in 2013, backup and column families in 2014, transactions and bulk loading in 2015, and persistent cache in 2016. See the list of features that are not in LevelDB: https://github.com/facebook/rocksdb/wiki/Features-Not-in-LevelDB .

Early RocksDB adopters at Facebook such as the distributed key-value store ZippyDB [2], Laser [2] and Dragon [3] went into production in early 2013. Since then, many more new or existing services at Facebook started to use RocksDB every year. Now RocksDB is used in a number of services across multiple hardware platforms at Facebook. [1] https://code.facebook.com/posts/666746063357648/under-the-hood-building-and-open-sourcing-rocksdb/ and http://rocksdb.blogspot.com/2013/11/the-history-of-rocksdb.html [2] https://research.facebook.com/publications/realtime-data-processing-at-facebook/ [3] https://code.facebook.com/posts/1737605303120405/dragon-a-distributed-graph-query-engine/ Why did FB go down the RocksDB path for MySQL?

MySQL is a popular storage solution at Facebook because we have a great team dedicated to running MySQL at scale that provides a high quality of service. The MySQL tiers store many petabytes of data that have been compressed with InnoDB table compression. We are always looking for ways to improve compression and the LSM algorithm used by RocksDB has several advantages over the B-Tree used by InnoDB. This led us to MyRocks: RocksDB is a key-value storage engine. MyRocks implements that MySQL storage engine API to make RocksDB work with MySQL and provide SQL functionality. Our initial goal was to get 2x more compression from MyRocks than from compressed InnoDB without affecting read performance. We exceeded our goal. In addition to getting 2x better compression, we also got much lower write rates to storage, faster database loads, and better performance.

Lower write rates enable the use of lower endurance flash, and faster loads simplify the migration from MySQL on InnoDB to MySQL on RocksDB. While we don't expect better performance for all workloads, the way in which we operate the database tier for the initial MyRocks deployment favors RocksDB more than InnoDB. Finally, there are features unique to an LSM that we expect to support in the future, including the merge operator and compaction filters. MyRocks can be helpful to the MySQL community because of efficiency and innovation.

We considered multiple write-optimized database engines. We chose RocksDB because it has excellent performance and efficiency and because we work directly with the team. The MyRocks effort has benefited greatly from being able to collaborate on a daily basis with the RocksDB team. We appreciate that the RocksDB team treats us like a very important customer. They move fast to make RocksDB better for MyRocks.

How was MyRocks developed? MyRocks is developed by engineers from several locations across the globe. The team had the privilege to work with Sergey Petrunia right from the beginning, and he is based in Russia. At Facebook's Menlo Park campus, Siying Dong leads RocksDB development and Yoshinori Matsunobu leads the collaboration with MySQL infrastructure and data performance teams. From the Seattle office, Herman Lee worked on the initial validation of MyRocks that gave the team the confidence to proceed with MyRocks for our user databases as well as led the MyRocks feature development. In Oregon, Mark Callaghan has been benchmarking all aspects of MyRocks and

Let’s start by creating a table, where we will store the CIDR, split in two columns: one for the IPv6 address and one for the network length. The most compact way of storing IPv6 values is to use the binary(16) .

CREATETABLE `cidr` ( `id` int(11) NOT NULL AUTO_INCREMENT, `ip_address` binary(16) NOT NULL, `net_len` int(11) NOT NULL, PRIMARYKEY (`id`) ) ENGINE=InnoDB;

In order to generate some random data, I will use a stored procedure which will insert 100,000 IP addresses in the previously created table. The addresses will be generated by inserting the first 16 bytes of the current time-stamp’s SHA value.

DELIMITER // CREATEPROCEDUREgenerate_ips(no_ipsINT) BEGIN DECLARE x INT DEFAULT 0; DECLARE ipbinary(16); DECLARE rand_net_lenINT; REPEAT SET x = x + 1; # Generate a random ip address SETip= substring(unhex(sha(RAND())), 1, 16); # Generate a random network length between 64 and 123 SETrand_net_len= 64 + FLOOR(RAND() * 60); INSERTINTOcidr(ip_address, net_len) values (ip, rand_net_len); UNTIL x > no_ipsEND REPEAT; END // DELIMITER ; # Generate the IPs CALLgenerate_ips(100000);

Ifyou want to insert the values from a human-readable string, MySQL provides the INET6_ATON("2001:0db8:85a3:0000:0000:8a2e:0370:7334") function which will strip the colons (“:”)and convertthe 32 characters to a binary(16) string.

Next Iwill select the first 3 IP address. In order to do thatI will use the INET6_NTOA function, which will convert from the binary format to the human-readable format.

mysql> SELECTid, INET6_NTOA(ip_address), net_lenFROMcidrLIMIT 3; +----+-----------------------------------------+---------+ | id | INET6_NTOA(ip_address)| net_len | +----+-----------------------------------------+---------+ |1 | 200d:31c4:1905:9eb2:3c7f:c45c:de78:42cd |97 | |2 | 59b0:c4d6:48b4:3717:f031:d05b:705d:6c65 |95 | |3 | 788e:3f48:e62b:c3bb:da10:6a03:f987:7a16 |110 | +----+-----------------------------------------+---------+ 3 rowsin set (0,01 sec)

Next, I would like to select the network mask, host mask, network address and also generate the network range intervals. For this I will create a few helper functions:

# Returns the net mask based on the network length DELIMITER // CREATEFUNCTION net_mask(net_lenint) RETURNSbinary(16) DETERMINISTIC BEGIN RETURN (~INET6_ATON('::') << (128 - net_len)); END // # Returns the network address using an IP and the network length CREATEFUNCTION subnet(ipBINARY(16), net_lenint) RETURNSbinary(16) DETERMINISTIC BEGIN RETURN ip & ((~INET6_ATON('::') << (128 - net_len))); END // # Returns the host mask CREATEFUNCTION host_mask(net_lenint) RETURNSbinary(16) DETERMINISTIC BEGIN RETURN (~INET6_ATON('::') >> net_len); END // DELIMITER ;