[AWS] DynamoDB: Designing Partition Keys to Distribute Your Workload Evenly

Read: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-partition-key-uniform-load.html

https://aws.amazon.com/blogs/database/choosing-the-right-dynamodb-partition-key/

DynamoDB stores and retrieves each item based on the primary key value, which must be unique. Items are distributed across 10-GB storage units, called partitions (physical storage internal to DynamoDB). Each table has one or more partitions, as shown in the following illustration.

DynamoDB uses the partition key’s value as an input to an internal hash function. The output from the hash function determines the partition in which the item is stored. Each item’s location is determined by the hash value of its partition key.

All items with the same partition key are stored together, and for composite partition keys, are ordered by the sort key value. DynamoDB splits partitions by sort key if the collection size grows bigger than 10 GB.

Keynote:

  • A poor PK causes "hot" partitions that result in throttling (if exceed capabilities for I/O, otherwise not).
  • More distinct PK values that your workload accesses, the MORE those requests will be spread across the partitioned space, so that you will use provisioned throughput more eficiently.

Partition keys and request throttling

DynamoDB evenly distributes provisioned throughput—read capacity units (RCUs) and write capacity units (WCUs)—among partitions and automatically supports your access patterns using the throughput you have provisioned. However, if your access pattern  exceeds 3000 RCU or 1000 WCU for a single partition key value, your requests might be throttled with a ProvisionedThroughputExceededException error.

Reading or writing above the limit can be caused by these issues:

  • Uneven distribution of data due to the wrong choice of partition key
  • Frequent access of the same key in a partition (the most popular item, also known as a hot key)
  • A request rate greater than the provisioned throughput
Partition key valueUniformity

User ID, where the application has many users.

Good

Status code, where there are only a few possible status codes. Bad
Item creation date, rounded to the nearest time period (for example, day, hour, or minute). Bad
Device ID, where each device accesses data at relatively similar intervals. Good
Device ID, where even if there are many devices being tracked, one is by far more popular than all the others. Bad

Seeing last tow rows. both are DeviceID,

  • if each device are being accessed evenlly, then it is a good partition key
  • if one device is more popular than the rest, then it is a bad patition key

How PK effect performance?

The partition key represents the item's creation date, rounded to the nearest day. The sort key is an item identifier. On a given day, say 2014-07-09, all of the new items are written to that single partition key value (and corresponding physical partition). -- So it is not spread evenlly.

Recommendations for partition keys

Use high-cardinality attributes

Using unqie id.

Cache the popular items

Using DAX

Add random numbers or digits from a predetermined range for write-heavy use cases

Suppose that you expect a large volume of writes for a partition key (for example, greater than 1000 1 K writes per second). In this case, use an additional prefix or suffix (a fixed number from predetermined range, say 1–10) and add it to the partition key.

Following is the recommended table layout for this scenario:

  • Partition key: Add a random suffix (1–10 or 1–100) with the InvoiceNumber, depending on the number of transactions per InvoiceNumber. For example, assume that a single InvoiceNumber contains up to 50,000 1K items and that you expect 5000 writes per second. In this case, you can use the following formula to estimate the suffix range: (Number of writes per second * (roundup (item size in KB),0)* 1KB ) /1000). Using this formula requires a minimum of five partitions to distribute writes, and hence you might want to set the range as 1-5.
  • Sort key: ClientTransactionid
  • Partition Key Sort Key Attribute1
    InvoiceNumber+Randomsuffix ClientTransactionid Invoice_Date
    121212-1 Client1_trans1 2016-05-17 01.36.45
    121212-1 Client1-trans2 2016-05-18 01.36.30
    121212-2 Client2_trans1 2016-06-15 01.36.20
    121212-2 Client2_trans2 2016-07-1 01.36.15

Because we have a random number appended to our partition key (1–5), we need to query the table five times for a given InvoiceNumber. Our partition key could be 121212-[1-5], so we need to query where partition key is 121212-1 and ClientTransactionid begins_with Client1. We need to repeat this for 121212-2, on up to 121212-5 and then merge the results.

Using Random number as PK

For example, consider the following schema layout of an InvoiceTransaction table. It has a header row for each invoice and contains attributes such as total amount due and transaction_country, which are unique for each invoice. Assuming we need to find the list of invoices issued for each transaction country, we can create a global secondary index with partition_key as trans_country. However, this approach leads to a hot key write scenario, because the number of invoices per country are unevenly distributed.

Table

Partition Key

Table

Sort Key

Attribute1

Attribute2

GSI

Partition_Key

Attribute3

GSI

Sort Key

Attribute4 Attribute5
InvoiceNumber Sort_key attribute Invoice_Date Random prefix range Trans_country Amount_Due Currency
121212 head 2018-05-17 T1 Random (1-N) USA 10000 USD
121213 head 2018-04-1 T2 Random (1-N) USA 500000 USD
121214 head 2018-04-1 T2 Random (1-N) FRA 500000 EUR

Following is the global secondary index (GSI) for the preceding scenario.

GSI

Partition Key

GSI

Sort Key

Trans_country

Projected Attributes
(Random range) Trans_country

Invoice_Number

Other Data attributes
1-N USA 121212  
1-N USA 121213  
1-N FRA 121214  

In the preceding example, you might want to identify the list of invoice numbers associated with the USA. In this case, you can issue a query to the global secondary index with partition_key = (1-N) and trans_country = USA.

Antipatterns for partition keys

Use sequences or unique IDs generated by the DB engine: 

 You cannot use TranscationID for any query purposes. So you lose the ability to use the partition key to perform a fast lookup of data.

Partition key Attribute1 Attribute2
TransactionID OrderID Order_Date
1111111 Customer1-1 2016-05-17 01.36.45
1111112 Customer1-2 2016-05-18 01.36.30
1111113 Customer2-1 2016-05-18 01.36.30

GSI support eventual consistency only,  with additional costs for reads and writes. Because normally you want to search by OrderID, generated id doesn't hold any meanings.

原文地址:https://www.cnblogs.com/Answer1215/p/14785036.html