DynamoDB: An Inside Look Into NoSQL – Part 7

In Part 6, we discussed handling failures via Hinted Handoff & Replica Synchronization. We also talked about the advantages of using a Sloppy Quorum and Merkle Trees.

In this last & final part of the series, we will look into Membership and Failure Detection.

Ring Membership

In any production environment, node outages are often transient. But it rarely signifies permanent failure and hence there is no need for repair or rebalancing the partition. On the other hand, manual errors might result in unintentional startup of new DynamoDB nodes. A proper mechanism is essential for the addition and removal of nodes from the DynamoDB Ring. An administrator uses a tool to issue a membership change command to either add / remove a node. The node that picks up this request writes into its persistent store the membership change request and the timestamp. A gossip-based protocol transfers the membership changes and maintains a consistent view of membership across all nodes. Each node contacts a peer chosen at random every second and the two nodes efficiently reconcile their persisted membership change histories. Partitioning & placement information also propagates via the gossip-based protocol and each storage node is aware of the token ranges its peers are responsible for. This allows each node to forward a key’s read/write operations to the right set of nodes directly.

Ring Membership (Credit)

External Discovery

It’s best to explain with an example: An administrator joins node A to the ring. He then joins node B to the ring. Nodes A and B consider itself as part of the ring, yet neither would be immediately aware of each other. To prevent these logical partitions, DynamoDB introduced the concept of seed nodes. Seed nodes are fully functional nodes that are discovered via an external mechanism (static configuration or a configuration service) and are known to all nodes. Since each node communicates with the seed node and gossip-based protocol transfer the membership changes, logical partitions are highly unlikely.

Failure Detection

Failure detection in DynamoDB is used to avoid attempts to communicate with unreachable peers during get() and put() operations and when transferring partitions and hinted replicas. For the purpose of avoiding failed attempts at communication, a purely local notion of failure detection is entirely sufficient: node A may consider node B failed if node B does not respond to node A’s messages (even if B is responsive to node C‘s messages). In the presence of a steady rate of client requests generating inter-node communication in the DynamoDB ring, a node A quickly discovers that a node B is unresponsive when B fails to respond to a message; Node A then uses alternate nodes to service requests that map to B‘s partitions; A also periodically retries B to check for the latter’s recovery. In the absence of client requests to drive traffic between two nodes, neither node really needs to know whether the other is reachable and responsive.


This exhaustive 7-part series detailing every component is sufficient to understand the design and architecture of any NoSQL system. Phew! What an incredible journey it has been these couple of months delving into the internals of DynamoDB. Having patiently read this far, you are amongst the chosen few who have this sort of deep NoSQL knowledge. You can be extremely proud of yourself!

Let’s eagerly await another expedition soon!

Article authored by Vijay Olety.

X-Post from CloudAcademy.

Posted in Technical | Tagged , , , , | Leave a comment

DynamoDB: An Inside Look Into NoSQL – Part 6

In Part 5, we spoke about data versioning, the 2 reconciliation strategies and how vector clocks are used for the same. In this article, we will talk about Handling Failures.

Handling Failures – Hinted Handoff

Even under the simplest of failure conditions, DynamoDB would experience reduced durability and availability if the traditional form of quorum approach is used. In order to overcome this it uses a sloppy quorum; all READ and WRITE operations are performed on the first N healthy nodes from the preference list, which may not be the first N nodes encountered by traversing the consistent hashing ring.

Partitioning & Replication of Keys in Dynamo (Credit)

Consider the above figure: If A is temporarily not reachable during a WRITE operation, then the replica that would have lived on A will be sent to D to maintain the desired availability and durability guarantees. The replica sent to D will have a hint in its metadata which tells who the intended recipient was (in our case, it is A). The hinted replicas are stored in a separate local database which is scanned periodically to detect if A has recovered. Upon detection, D will send the replica to A and may delete the object from its local store without decreasing the total number of replicas. Using hinted handoff, DynamoDB ensures that READ and WRITE are successful even during temporary node or network failures.

Highly available storage systems must handle failures of an entire data center. DynamoDB is configured such that each object is replicated across data centers. In terms of implementation detail, the preference list of a key is constructed such that the storage nodes are spread across multiple data centers and these centers are connected via high speed network links.

Handling Permanent Failures – Replica Synchronization

Hinted Handoff is used to handle transient failures. What if the hinted replica itself becomes unavailable? To handle such situations, DynamoDB implements an anti-entropy protocol to keep the replicas synchronized. A Merkle Tree is used for the purpose of inconsistency detection and minimizing the amount of data transferred.

According to Wikipedia, a “Merkle tree is a tree in which every non-leaf node is labelled with the hash of the labels of its children nodes.”. Parent nodes higher in the tree are hashes of their respective children. The principal advantage of a Merkle tree is that each branch of the tree can be checked independently without requiring nodes to download the entire tree or the entire data set. Moreover, Merkle trees help in reducing the amount of data that needs to be transferred while checking for inconsistencies among replicas. For instance, if the hash values of the root of two trees are equal, then the values of the leaf nodes in the tree are equal and the nodes require no synchronization. If not, it implies that the values of some replicas are different. In such cases, the nodes may exchange the hash values of children and the process continues until it reaches the leaves of the trees, at which point the hosts can identify the keys that are “out of sync”.

DynamoDB uses Merkle trees for anti-entropy as follows: Each node maintains a separate Merkle tree for each key range (the set of keys covered by a virtual node) it hosts. This allows nodes to compare whether the keys within a key range are up-to-date. In this scheme, two nodes exchange the root of the Merkle tree corresponding to the key ranges that they host in common. Subsequently, using the tree traversal scheme described above the nodes determine if they have any differences and perform the appropriate synchronization action. The disadvantage with this scheme is that many key ranges change when a node joins or leaves the system thereby requiring the tree(s) to be recalculated.

In the next and the final part of this series, we will discuss Membership and Failure Detection.

Article authored by Vijay Olety.

X-Post from CloudAcademy.

Posted in Technical | Tagged , , , , , , | Leave a comment

DynamoDB: An Inside Look Into NoSQL – Part 5

In Part 4, we talked about partitioning and replication in detail. We introduced consistent hashing, virtual nodes and the concept of coordinator nodes and preference list. In this article, we will discuss Data Versioning.

Data Versioning

Eventual consistency, introduced by DynamoDB, allows for the updates to be pushed to all storage nodes asynchronously. A put operation returns before the update is pushed to all replicas, which results in scenarios where a subsequent get operation may return a value that does not reflect the latest changes. Depending on network partitions and server outages, not all replicas might have the latest updates even after an extended period of time.

But there are certain requirements that do not need latest updates and can still tolerate certain inconsistencies. One such requirement is “Add To Cart” where put operation should always succeed and a get operation returning an old object is still tolerable. If a user makes a change to an earlier version of the shopping cart object, that change is still meaningful and should be preserved at all costs. However, the currently unavailable latest state of the shopping cart can have its own version of updates which should also be preserved. It is evident that data versioning has to be implemented to handle such scenarios.

In order to achieve such guarantees, DynamoDB treats the result of every modification as a new and immutable version of data. This allows for multiple versions of an object to be present in the system at the same time. If the newer version subsumes the earlier one, then the system can automatically determine the authoritative version (syntactic reconciliation). However, in some cases the versions conflict and the client has to manually perform the reconciliation (semantic reconciliation).

Data Versioning (Credit)

DynamoDB uses vector clocks in order to capture causality between multiple versions of an object. A vector clock is a list of (node, counter) pairs. One vector clock is associated with one version of every object. By analyzing the vector clocks, you can find out if the versions have a causal ordering or are on parallel branches. When a client wishes to perform an update, it must specify which version it is updating. This version can be got from an earlier get operation.

Version evolution of an object (Credit)

Let’s understand how vector clocks works: A client writes a new object. Node Sx handles this write and creates a vector clock [(Sx, 1)] for the object D1. If the client now updates the object and node Sx again handles the request, we get a new object D2 and its vector clock [(Sx, 2)]. The client updates the object again and this time node Sy handles the request leading to object D3 with the vector clock [(Sx, 2), (Sy, 1)]. When a different client tries to update the object after reading D2, the new vector clock entry for object D4 is [(Sx, 2), (Sz, 1)] where Sz is the node that handled the request. Now when a new write request is issued by the client, it sees that there are already D3 and D4 objects. If node Sx is handling the request, it performs the reconciliation process and the new data object D5 is created with the vector clock [(Sx, 3), (Sy, 1), (Sz, 1)].

One possible issue with vector clocks is that its size may grow rapidly if multiple servers coordinate the writes to a particular object. However, this issue is unlikely since in production only the nodes from the preference list handle the operation. It is still desirable to limit the size of vector clock. DynamoDB implements the following clock truncation scheme: A timestamp, which indicates the last time that node updated an item, is stored along with (node, counter) pair. When the number of (mode, counter) pairs reaches a threshold (say 15), the oldest pair is removed from the clock.

In the next article, we will look into how to Handle Failures.

Article authored by Vijay Olety.

X-Post from CloudAcademy.com

Posted in Technical | Tagged , , , | Leave a comment

DynamoDB: An Inside Look Into NoSQL – Part 4

In Part 3, we mentioned the various distributed techniques used while architecting NoSQL data stores. A table nicely summarized these techniques and their advantages. In this article, we will go into the details of partitioning and replication.


As mentioned earlier, the key design requirement for DynamoDB is to scale incrementally. In order to achieve this, there must be a mechanism in place that dynamically partitions the entire data over a set of storage nodes. DynamoDB employs consistent hashing for this purpose. As per the Wikipedia page, “Consistent hashing is a special kind of hashing such that when a hash table is resized and consistent hashing is used, only K/n keys need to be remapped on average, where K is the number of keys, and n is the number of slots. In contrast, in most traditional hash tables, a change in the number of array slots causes nearly all keys to be remapped.”

Consistent HashingConsistent Hashing (Credit)

Let’s understand the intuition behind it: Consistent hashing is based on mapping each object to a point on the edge of a circle. The system maps each available storage node to many pseudo-randomly distributed points on the edge of the same circle. To find where an object O should be placed, the system finds the location of that object’s key (which is MD5 hashed) on the edge of the circle; then walks around the circle until falling into the first bucket it encounters. The result is that each bucket contains all the resources located between its point and the previous bucket point. If a bucket becomes unavailable (for example because the computer it resides on is not reachable), then the angles it maps to will be removed. Requests for resources that would have mapped to each of those points now map to the next highest point. Since each bucket is associated with many pseudo-randomly distributed points, the resources that were held by that bucket will now map to many different buckets. The items that mapped to the lost bucket must be redistributed among the remaining ones, but values mapping to other buckets will still do so and do not need to be moved. A similar process occurs when a bucket is added. You can clearly see that with this approach, the additions and deletions of a storage node only affects its immediate neighbors and all the other nodes remain unaffected.

The basic implementation of consistent hashing has its limitations:

  1. Random position assignment of each node leads to non-uniform data and load distribution.
  2. Heterogeneity of the nodes is not taken into account.

To overcome these challenges, DynamoDB uses the concept of virtual nodes. A virtual node looks like a single data storage node but each node is responsible for more than one virtual node. The number of virtual nodes that a single node is responsible for depends on its processing capacity.


Every data item is replicated at N nodes (and NOT virtual nodes). Each key is assigned to a coordinator node, which is responsible for READ and WRITE operation of that particular key. The job of this coordinator node is to ensure that every data item that falls in its range is stored locally and replicated to N - 1 nodes. The list of nodes responsible for storing a certain key is called preference list. To account for node failures, preference list usually contains more than N nodes. In the case of DynamoDB, the default replication factor is 3.

In the next article, we will look into Data Versioning.

Article authored by Vijay Olety.

X-Post from CloudAcademy.com

Posted in Technical | Tagged , , , , , , , | Leave a comment

DynamoDB: An Inside Look Into NoSQL – Part 3

In Part 2, we looked at Design Considerations of NoSQL and introduced the concept of eventual consistency. In this article, we will introduce the concepts and techniques used while architecting a NoSQL system.

The core distributed systems techniques employed in DynamoDB are – partitioning, replication, versioning, membership, failure handling and scaling. Phew! Did you really think that the internals will be simple? 🙂

DynamoDB Internals

DynamoDB (Credit)

The following table summarizes the list of techniques used in DynamoDB:

Problem Technique Advantage
Partioning Consistent Hashing Incremental Scalability
High Availability for Writes Vector Clocks with reconciliation during reads Version size is decoupled from update rates
Handling temporary failures Sloppy Quorum and Hinted Handoff Provides high availability & durability guarantee when some replicas are not available
Recovery from permanent failures Anti-entropy using Merkle trees Synchronizes divergent replicas in the background
Membership & Failure Detection Gossip-based membership protocol & failure detection Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information

I know I have covered a lot of lingo and buzz words. If you really need to take a deep breath, now is the time! Fear not, in subsequent articles we will deep-dive into each of the above mentioned techniques. Remember, the devil is in the details!

System Interface

DynamoDB exposes two interfaces: get and put. The get(key) operation locates the object associated with the key and returns it along with a context. The put(key, context, object) operation uses key to determine the associated replicas and writes the object in those replicas. The context information is invisible to the user and contains metadata such as version. DynamoDB applies an MD5 hash on the key to generate a 128-bit identifier, which is used to determine the replicas that are responsible for serving the key.

Article authored by Vijay Olety

X-Post from CloudAcademy.com

Posted in Technical | Tagged , , , , , | Leave a comment

DynamoDB: An Inside Look Into NoSQL – Part 2

In Part 1, we introduced you to NoSQL, spoke about the CAP theorem and certain assumptions that need to be made while designing NoSQL data stores. Let’s dive deeper!

Design Considerations

Traditional commercial systems and applications perform data replication in a synchronized manner. The advantage of this approach is that data is always consistent. But the down side is that the system might itself not be available (CAP theorem). To put it simple: the data is unavailable until it is absolutely certain that it is replicated across all nodes correctly.

Alas! The Web world lives in its own perceived reality. 🙂 Systems go down and network fails regularly. Availability is the single largest factor which makes / breaks a company. It is thus imperative that we handle such scenarios. Netflix’s Chaos Monkey helps us architect our product to take into account these failures. In order to ensure availability at all costs, optimistic asynchronous replication strategies can be put in place. The drawback, however, is that it leads to conflicting changes to data which must be detected and resolved. The process of conflict resolution introduces 2 new problems: when to resolve them and who resolves them. DynamoDB introduces the novel concept of eventually consistent data store; that is all updates reach all nodes eventually.

Deciding when to perform the conflict resolution is a primary design consideration. We can perform it during the READ operation or WRITE operation. Many legacy data stores chose to do the conflict resolution during the WRITE operation. In such systems, WRITEs will be rejected if data is not replicated across all nodes. In large e-commerce companies such as Amazon, rejecting WRITEs is not an option as it leads to revenue loss and poor customer experience. Hence, DynamoDB does the complex conflict resolution during READs.

Let’s take an example to understand it better. Consider a system with 3 nodes: NODE1, NODE2 and NODE3. In a traditional system, a WRITE to NODE2 must be replicated to NODE1 and NODE3 and only then is the WRITE operation considered successful. This synchronized replication takes time to complete during which time the system is NOT available. But systems using DynamoDB have the option to defer this update in exchange for higher availability. So a WRITE to NODE2 is considered successful as long as NODE2 is able to honor that request. NODE2 eventually replicates it to NODE1 and NODE3. DynamoDB usually takes a second (or a maximum of a couple of seconds) to achieve consistency across all nodes.
Note: In case your product, like ours, needs a strongly consistent read just set the value of the attribute ConsistentRead to true.

Another very important design consideration is who performs the conflict resolution. It can either be done by the data store (DynamoDB in our case) or the application. The data store usually employs simple policies and rules such as “last WRITE wins”, which is pretty good in majority of the cases. If the application wishes to have complex rules and implement its own conflict resolution mechanisms, then it is free to do so.

A couple of other design considerations are as follows:

  1. Incremental Scalability: The data store should be able to scale-out 1 node at a time, with minimal or no impact on the system itself.
  2. Symmetry: All nodes in the data store are peers, i.e. all nodes are equal and share the same set of responsibilities.
  3. Decentralization: With a central authority, the most common problem faced is “single point of failure”. Decentralization helps us mitigate this and keep the system simple, more scalable and more available.
  4. Heterogeneity: Different nodes in the data store might have different configurations. Some nodes might be optimized for storage and some might be plain commodity hardware. The data store should take into account this heterogeneous mix of nodes to distribute tasks proportional to its capabilities.

In the next blog post, we will look into System Architecture.

Article authored by Vijay Olety

X-Post from CloudAcademy.com

Posted in Technical | Tagged , , , | Leave a comment

DynamoDB: An Inside Look Into NoSQL – Part 1

In our earlier posts (here and here), we introduced the Hadoop ecosystem & explained its various components using a real world example of the retail industry. We now possess a fair idea of the advantages of Big Data. NoSQL datastores are being used extensively in real-time & Big Data applications, which is why a look into its internals would help us make better design decisions in our applications.

NoSQL datastores provide a mechanism for retrieval and storage of data items that is modeled in a non-tabular manner. Simplicity of design, horizontal scalability and control over availability form the motivations for this approach. NoSQL is governed by the CAP theorem in the same way RDBMS is governed by ACID properties.

NoSQL Triangle (Credit)

From the AWS stable, DynamoDB is the perfect choice if you are looking for a NoSQL solution. DynamoDB is a “fast, fully managed NoSQL database service that makes it simple and cost-effective to store and retrieve any amount of data, and serve any level of request traffic. Its guaranteed throughput and single-digit millisecond latency make it a great fit for gaming, ad tech, mobile and many other applications.” Since it is a fully managed service, you need not worry about provisioning & managing of the underlying infrastructure. All the heavy-lifting is taken care for you.

Majority of the documentation available on the Net are how-to-get-started guides with examples of DynamoDB API usage. Let’s look at the thought process and design strategies that went into the making of DynamoDB.

“DynamoDB uses a synthesis of well known techniques to achieve scalability and availability: Data is partitioned and replicated using consistent hashing, and consistency is facilitated by object versioning. The consistency among replicas during updates is maintained by a quorum-like technique and a decentralized replica synchronization protocol. DynamoDB employs a gossip based distributed failure detection and membership protocol. Dynamo is a completely decentralized system with minimal need for manual administration. Storage nodes can be added and removed from DynamoDB without requiring any manual partitioning or redistribution.” You must be wondering – “Too much lingo for one paragraph”. Fret not, why fear when I am here 🙂 Let’s take one step at a time, shall we!

Requirements and Assumptions

This class of NoSQL storage system has the following requirements –

  • Query Model: A “key” uniquely identifies a data item. Read and write operations are performed on this data item. It must be noted that no operation spans across multiple data items. There is no need for relational schema and DynamoDB works best when a single data item is less than 1MB.
  • ACID Properties: As mentioned earlier, there is no need for relational schema and hence ACID (Atomicity, Consistency, Isolation, Durability) properties are not required. The industry and the academia acknowledge that ACID guarantees lead to poor availability. Dynamo targets applications that operate with weaker consistency if it results in high availability.
  • Efficiency: DynamoDB needs to run on commodity hardware infrastructure. Stringent SLA (Service Level Agreement) ensure that latency and throughput requirements are met for the 99.9% percentile of the distribution. But everything has a catch – the tradeoffs consist of performance, cost, availability and durability guarantees.

In subsequent articles, we will look into Design Considerations & System Architecture.

Article authored by Vijay Olety


  1. http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

X-Post from CloudAcademy.com

Posted in Technical | Tagged , , , | Leave a comment