【转】Terracotta's Scalability Story

Introduction

Open Terracotta is a product that delivers JVM-level clustering as a runtime infrastructure service. It is Open Source and available under a Mozilla-based license.

Open Terracotta provides Java applications with a runtime environment that allows developers to trust critical parts of heap as reliable and capable of scaling through shared access across multiple machines. The technology hinges on a clustering server that keeps a shared view of objects across JVMs. The key question around scalability of a Terracotta-based application can only be answered by analysis of the architecture and the alternatives.

Terracotta uses bytecode instrumentation to adapt the target application at class load time. In this class loading phase it extends the application in order to ensure that the semantics of the Java Language Specification (JLS) are correctly maintained across the cluster, including object references, thread coordination, garbage collection etc. Another important thing to mention is that Terracotta does not use Java Serialization, which means that any POJO can be shared across the cluster. What this also means is that Terracotta is not sending the whole object graph for the POJO state to all nodes but breaks down the graph into pure field-level data and is only sending the actual "delta" over the wire, meaning the actual changes - the data that is "stale" on the other node(s).

Terracotta is based on a central server architecture. Scalability and operations of server architectures are well understood by IT organizations. Historically, the network and the database have seen much success and maturation and are both based on such an approach. If we compare apples to apples moving objects around a cluster of JVMs using various approaches, centralized servers may or may not be faster than peer-to-peer approaches. If we stop moving object graphs and start moving only fields as those fields change (int for int, string for string, etc.) we end up changing the game. Let us analyze now how a central server might scale better than peer to peer, how peer to peer at scale ends up operating as multiple central servers, and how the addition of fine-grained replication and passing of object requests to a single arbiter of state can in fact produce a highly scalable solution.

The Architecture

The core of the Open Terracotta architecture is a clustering server. It can be viewed as many things from the memory equivalent of a network-attached storage server (otherwise known as network-attached memory), to a traffic cop—watching conversations between applications and end users and the subsequent demand for objects in each application’s underlying JVM—and making sure all data flows correctly around the cluster.

The Terracotta Server is designed to help an application scale and simultaneously be highly available. As a result, there are inherent trade-offs for one in favor of the other. It is Terracotta's job to make these trade-offs in an intelligent manner. Let us analyze the basic thinking behind the design now.

First, we must understand the data format for objects in the server. Each field of each object that is being shared is stored discretely. Essentially, think of each object as normalized into key-value pairs and, in some cases, a key-value pair is simply a reference to a child object. For example, if Adam is a person and he is the first object in our cluster of people, and Adam has a son Cain, then Adam may have an Integer "age" field and a string "name" field. He would also have an object reference to a "Child" field which points to Cain. So, we can store all this in the following logical structure (not Terracotta's actual storage format):

Terracotta ID	Key	Value
1	Name	Adam
1	Age	71
1	Child	2
2	Name	Cain
2	Age	24
2	Child	Null

This means that a single field like "Adam's Age" can be updated by itself both in-memory in the cluster as well as in any durable disk-based storage that a Terracotta user may choose to leverage. It also means that the object can be kept synchronized across multiple JVMs without having to share a single memory space.

Each application-level JVM that participates in the cluster is called an L1 client. This is because, if we were to view the application cluster as a large SMP motherboard, the JVMs are CPU and expensive I/O needs to be avoided to keep our logical CPU from going idle. I/O is avoided by leveraging L1 caching, both in the motherboard as well as in the Terracotta analogy. The Terracotta Server can be thought of as a shared L2 for all machines to leverage. Access to the L2 costs more than access to the L1—in the case of Terracotta, this is due to the fact that the L2 is across a network connection from all JVMs—but the L2 is cheaper than other I/O such as database or messaging that would otherwise be invoked.

An aside on correctness: the L2's role cannot be underestimated in terms of the correctness it ensures for the clustered application. Even though Terracotta is used to store transient data (data not later needed for data warehousing or compliance and reporting), it is in fact the system of record for that short-lived data. If Terracotta were to allow inconsistency in the data, the application would be broken in ways the developer had not anticipated (race conditions, for example). Furthermore, if Terracotta were to be in-process with the JVM, but were to scale by replicating data through a sort of "buddy system," the data would not be resilient to operator intervention (cluster-wide reboot, for example). So, we chose a server-based architecture for at least one reason; the operations team cannot break the clustered application simply by restarting all the application servers.

Second, and more important than the ability to restart in our opinion, is avoiding network partitioning scenarios classically referred to as "splitting the brain." Network partitioning is a clustering flaw wherein part of the networked application believes it is the entirety of the cluster in existence (say servers 'A' and 'B' in a cluster) whereas another group believes it is the entire cluster (say servers 'C' and 'D'). So A and B speak only to each other about object updates while C and D speak only to each other. Meanwhile end transactions are being requested from A, B, C, and D and thus the collective brain has been split and we have two different views of the world—the A+B view and the C+D view and any transaction that hops from A+B to C+D, or vice versa, will be very confused.

The presence of a clustering server eliminates these two classes of problem. Operators can safely restart any application servers as well as Terracotta itself and nothing is lost. Plus, application servers are either talking to the server or not; so there is no question whether we have a cohesive cluster at any point in time. But all this correctness and availability comes at the cost of scalability, right? Surely the server is a bottleneck and eventually fails to scale. Honestly, the answer is both yes and no.

Scalability

While Terracotta is logically a bottleneck, there are two reasons it is good to have a server in the name of scalability:

The central nature of the server means that no JVMs have to discover where data is stored or which data needs to be updated. We call this an O(1) algorithm for clustering as opposed to an O(n) algorithm. (We will get into how O(n) creeps back in but in a different manner than in peer-to-peer clustering).
The server sees all data flow amongst the JVMs, and can therefore keep track of what is needed in each JVM's heap. This means O(n) network conversations may never occur (for example, sticky sessions may never be accessed by more than one application server in a web application cluster) even though we have objects capable of being shared between arbitrary JVMs.

Let us address each reason in detail now. The concept of Big-O algorithm analysis is outside the scope of this document, but O(1) can be summarized as a "constant time" algorithm. This does not mean that Terracotta's architecture completes operations in one second, or in any specific amount of time. It refers to the architecture's ability to execute certain operations in an amount of time not specifically coupled to the number of nodes in the cluster.

O(1) versus O (n)

If we think about this concept, it is counter to clustering. If an application is distributed across four nodes as opposed to two, should that cluster not take twice as many operations to update objects on all four nodes? If a clustering architecture were to send updates one-by-one, then four nodes would achieve half the clustered throughput of two. And, if we were to use a multicast approach, then we would lower the elapsed time for the update by going parallel in our cluster updates. But, if we confirm (ACK) those updates in all clustered JVMs—and for the case of correctness we really ought to acknowledge all updates on all nodes where the object lies resident—we still have to wait for 3 acknowledgements to each update in a four-node cluster and our design is, thus, O(n).

It might be easier to conceptualize the O(n) nature of a clustered application's networking conversation when thinking about locks or transactions instead of updates. An object could be optimistically cached on a node and thus, not need to be updated synchronously, but a transaction must be flushed and acknowledged before it is complete (otherwise it is not a transaction at all). If all nodes have to acknowledge the transaction, we have O(n) performance for n-node clusters. But, if we instead think of every JVM in the cluster as clustered with only one peer and if we somehow make that peer a coincident peer to every other JVM, then all JVMs only update their one peer. Now, the architecture is such that updates, transactions, etc. work in a fixed number of operations in that the central server is the only cluster member who needs to acknowledge updates. Those transactions can be small in that they only represent the changed data. This is how the clustering server supports many connected L1 clients; the number of clients is less important than the amount of data that is changing and shared amongst them.

Eventually, it becomes necessary to scale the server to multiple server instances. In order to scale the server, we have to find an O(1) scaling algorithm or we have lost the benefit of the server altogether. In short, one exists in that nothing in the specification of the server requires that all objects be housed in the same server. Since we are storing objects at a fine-grained field level we can store parts of objects separate and apart from each other. Note that it would be a misnomer to assert that the servers are clustered. They are essentially sharing nothing with each other and can each work as automatons. So the server is (servers are) O(1) and capable of scaling. This starts to sound a lot like a peer to peer cluster, except that we never see an O(n) performance bottleneck because we needn't push data amongst the clustering servers. This is more like a network fabric where different subnets route through different switches on different floors and never see each others network traffic.

As suggested earlier, there is no free ride. O(n) creeps in to server-centric architectures if an object is always needed on all JVMs. Eventually that object must be updated from the central server to all JVMs, regardless which single JVM changes it. But note that, until any one JVM attempts to re-read the object, the central server does not need to push the object to that JVM. This is due to the fact that under Terracotta's architecture the server is guaranteed to be called upon when entering into a transaction and the server always knows if objects in L1 are stale and need to be updated. Thus, while we eventually update all JVMs, we update those JVMs in precisely targeted O(1) fashion in a buddy conversation with the server, and only the server. In other words, object updates do not have to wait for a cluster-wide ACK in all cases.

Fine Grained Updates

While O(1) object updates and bookkeeping in the server are important to the scalability of Terracotta's architecture, the next feature lends at least as much credence to Terracotta's claims around its ability to scale, if not more. Since Terracotta breaks objects into their constituent fields, only data representing the changed fields needs to be pushed to the Terracotta Server in order to complete an update. And JVMs can store parts of objects, faulting in only what they need from the L2 (central server) as needed, which is not otherwise possible.

Fine grained changes leads to less data movement over the network. It also allows Terracotta to persist objects to disk since a 4 byte change to a 1 kilobyte object, for example, only writes 4 bytes to the Terracotta Server's disk. In the low latency race that big businesses deal with today, there is no latency that can be achieved that is lower than that of the network below the clustered application. Specifically, if a packet takes 1 ms to move between servers, minimum, then the less data added to that 1 ms latency, the closer to 1 ms that clustering technology can update objects. And, thus, a technology that moves only object field-level changes is more likely to be capable of achieving far lower latencies (more accurately, higher throughput) than any approach that moves more data on the same underlying network technology.

Bookkeeping and Run Time Optimization

So far we have only touched upon the power of bookkeeping in the L2 to accelerate application performance. Let us now discuss a Terracotta-specific feature that illustrates the power of codeless clustering and the ability of the runtime to optimize based on data it gathers about the application at runtime.

The notion of a clustered lock appears on the surface to be an expensive construct compared to the act of acquiring a mutex inside the kernel of an operating system. But nothing about clustering requires that we grab a clustered construct. In fact, quite the opposite occurs under the notion of a greedy lock. With the L2, the clustering software can begin to count the number of times for which a lock is contended. If that lock turns out to be uncontested in its use by one JVM and if that lock seems to be locked and unlocked frequently by that JVM or a thread in that JVM, then an inversion of control occurs.

Greedy locks are those in which the cluster hands the lock to a particular JVM for local-only use and asks for the lock back only when another JVM needs that lock. The cluster most certainly needs the ability to reclaim locks from failed JVMs or JVMs that have in some way begun to starve other JVMs for the lock, but this is all possible as long as one central authority decides which JVMs are alive, dead, and which JVMs are rapidly acquiring and releasing locks. Such optimizations are not easily achieved in compiled code simply because the developer needs to know the locking hotspots in the application beforehand. And those hotspots may be data driven or unique to a point in time and may never occur again in the application’s lifetime. Furthermore, if a peer to peer group of servers were to attempt to keep track of each others’ lock / unlock data, they would have to either share with each other statistics as the usage data was changing or they would risk losing optimization opportunities that are short-lived, and thus in the game of averages, may get the cluster optimization decisions wrong altogether.

So, the central server is in fact a strength in scaling applications as opposed to the weakness it appears to be on first assessment. And, we have asserted (arguably with little proof, but availability is a separate discussion from this analysis’s focus on scalability) that server-centric architectures are easier for operations teams to make highly available. But what about some real-world data?

Use Case

In a recent use case Terracotta encountered a customer that had an application that wrote thousands of times per second to objects that were being shared on every JVM. The customer devised the following transaction benchmark:

Each object in the system contains a list of 100 sub-objects. Each sub-object has 4 250-byte Strings, 2 ints, and 2 long fields. An object is roughly 1KB.
Seed the system with 100,000 objects (each including the 100 sub-objects for each object, thus populating the system with approximately 10,000,000 million Java objects totaling 10GB of data)
Monotonically increase the number of objects in the system. Add at least 5000 objects / second (including sub-objects we are actually introducing 500,000 Java objects per second but in 5000 batches).

To this system we introduce transactions defined as selecting a random object from the 100,000-plus objects. We then pick 10 sub-objects from that object and read all the fields of 5 sub-objects (by assigning them to method-local variables) and write all the fields of 5 other sub-objects. The customer needed this type of "transaction" to occur 80,000 times per second. They tested several solutions including in-memory databases and Terracotta. (Note that peer to peer grids were not an option as there are thousands of cluster members spread across many data centers worldwide, the real data would not fit in RAM of the machines without significant partitioning that would end up adding thousands more servers and, critically, any one JVM had a less than 5% likelihood of seeing requests for the same objects twice in the object’s lifetime so using the JVM's RAM for grid-enabled storage was deemed wasteful.)

The results were resoundingly in Terracotta's favor. With client JVMs (L1's) configured to not cache any data at all, the L2 was capable of running 9500 of these complex transactions per second. It is worth noting that 9500 transactions per second on a single 4-way Linux server amounts to an average of 9.5 transactions per millisecond. Clearly this cannot be achieved on even gigabit Ethernet (what the customer was using). For example, if each of 9 transactions made its own network call, each would suffer on the order of one millisecond round trip network penalty on the Ethernet so we can safely conclude that the reads and writes are traveling in some sort of efficient batch or they would take 9 – 10 milliseconds instead of the 1 they take on average. Terracotta's ability to batch object reads and writes together based on intelligent pre-fetch algorithms much like those in disk I/O (block read ahead and chaining / collocating) contributes to the system’s ability to complete more transactions per unit time than expected.

To get to 80,000 transactions per second, the customer operates a total of 20 Terracotta Servers (half as hot-standbys in case the primary 10 L2s fail). And, since in each transaction only the 10 sub-objects being read from and written to move on the network, the customer realized the benefit of being able to distribute Terracotta servers around data centers evenly and to hit those L2s across the WAN in most cases.

Peer-to-peer That Scales

It is important to note that, contrary to the tone of this analysis thus far, Terracotta does not wish to represent that peer-to-peer technologies available in the market today cannot scale. They do scale. They scale by some sort of object homing technique whereby objects are stored in only a few servers in the cluster. If an object 'A' is housed in only 2 servers in the cluster, then any node can update 'A' by speaking to its server home as the authority. This is just a server done differently (using terms from our discussion, by embedding the L2 server in the L1 client).

Conclusion

As we have seen, central server architecture provides a reasonable trade-off between availability and scalability. The Terracotta Server is a physical component in the infrastructure as opposed to a logical one. It can be managed in that it can be started and stopped, upgraded, failed-over and made fault tolerant with techniques operations teams understand well. Terracotta's availability can be summarized as "I can point to the box and say 'that's where the object data is. As long as it is around, our data is safe.'"

The server allows the application to share data in an O(1) fashion that scales twofold by both moving only data that changes only to nodes that actively request that data as well as by dividing, or spreading objects and fields across multiple central servers. Critically, a central server can play traffic cop for the networked application and can, thus, optimize the application for networking by changing the semantics of clustered locks and clustered updates on demand based on runtime load characteristics.

A few of the world’s top one hundred websites have validated the approach for themselves. And, at small scale—two, four or more nodes in a cluster—Terracotta has been tested to deliver linear scale from one node, on up for the JPetStore use case. Terracotta does in fact scale. It makes availability easier to ensure. And, with heap-level replication as part of the runtime environment, both scalability and availability can be achieved at least a bit more easily than with proprietary solutions that require significant application changes.

转自：http://www.theserverside.com/news/1364132/Terracottas-Scalability-Story