Introduction to Table Storage in Windows and SQL Azure

In addition to SQL Azure, Windows Azure features four persistent storage forms Tables, Queues , Blobs and Drives. In this article we will focus on Tables.

Table Storage

Tables are a very interesting new storage method offered in Windows Azure and are Microsoft’s Azure answer to Amazon SimpleDB..  The SQL Azure database offers a wealth of features a modern database might be expected to provide. However for many purposes it is overkill. In the past almost any structured data had to go into the database and incur the performance penalty that entailed. With Azure Tables data which has a relatively simple structure (below we use the example of a Movies data, which is a listing of Movies with different attributes such as title, category, date etc but is not a very complex dataset).

Tables offer structured data storage for data that has relatively simple relationships. Data is stored in rows and as tables are less structured and don’t have the overhead of a full database it is massively scalable and offers very high performance.  The interface to Azure Tables is the familiar .NET suite of classes, LINQ, and REST.

To create an Azure Table first create a storage service in the Windows Azure Developer Portal, then create a storage account and from create tables. Each table is scoped to its storage account so different tables with the same name can be used but scoped to different storage accounts.

Table Data Model

Tables are composed of rows and columns. For the purposes of the Azure Table Data Model, rows are entities and columns are properties. For an Entity a set of Properties can be defined but several properties are mandatory – PartitionKey, RowKey and TimeStamp. PartitionKey and RowKey can be thought of as a clustered index which uniquely identifies an entity and defines the sort order. TimeStamp is a read-only property.

Partitions

Table partitions can be thought of as units of scale within Windows Azure which are used for load balancing. Tables are partitioned based on the PartitionKey, all entities on the same PartitionKey will be served by a single server. Therefore selection of an appropriate PartitionKey is central to achieving scalability and higher throughput on Windows Azure. It is important to note that Azure implements throttling of an account when the resource utilization is very high, appropriate partitioning greatly reduces the potential for this happening by allowing the load to be distributed over different servers. The RowKey provides uniqueness within a single partition.
Partitions can be thought of as a higher level categories for the data with RowKeys are lower level data details. For example, for a ‘Movies’ table the PartitionKey could be the category of the movie such as comedy or sci-fi, RowKey could be the movie title (hopefully the combination of category and title would ensure uniqueness). Under load the table cold be split onto different servers  based on the category.
For a write intensive scenario such as logging the PartitionKey would normally be a timestamp. In this instance there is a problem in partitioning as the write will also append to the bottom of the Table and partitioning based on a range will not be efficient as the final partition will always be the only active partition. The recommended solution to this is to add a prefix to the timestamp to ensure that the latest write operations are sent to different partitions.

In database design, tables should be split based on the data type. For example in an retailer’ s database, data of the type ‘customer’ with fields such as ‘name’, ‘address’ etc should be in a separate table to the ‘orders’ which only contains data on orders such as ‘product’,  ‘order_date’ etc. However in Azure Tables these could both be efficiently stored in the same table as no space would be taken up by the empty fields (such as ‘order_date’ for a ‘customer’). To differentiate between the two types of data a ‘Kind’ property (column) can be added to each entity (row) which is in effect the table name if they were separated into two tables.

Table Operations

The operations are relatively similar to those of a conventional database – tables (which are analogous to the database) can be created, queried and deleted. Entities (rows) can have insert operations performed, delete operations, queries, and updated. There are two methods of update – Merge and Replace. Merge allows a partial update of the entity, thus if some of the properties of the entity are not given with the update they would not be updated (only the properties provided in the update are updated). Replace updates all the properties of an entity, if a property is not provided in the update it is removed from the entity. A newly introduced feature is Entity Group Transaction which is a transaction over a single partition.

Continuation Tokens

When a single entity(row is queried) the result is returned as with a database query. However when a range is requested Azure Tables can only return 1000 rows in a result set. If the result set is less than 1000 rows that result set is returned, if the result set is larger than 999 , the first 1000 rows of the result set are return together with a Continuation Token.  The Table is then re-queried with the Continuation Token passed back to the Table until the query completes.
Continuation Tokens are returned for all results where the   results is greater than one. They will also be returned if a query takes longer than 5 seconds (this is the maximum allowed by Azure after which the results are returned with a continuation token and the query must be rerun). Furthermore, continuation tokens are returned when the end of a partition range boundary is hit.

Optimizing Queries

Querying a table with a range is a very serial process, with result sets being sent to the client and continuation tokens being sent back for processing until the query completes. This structure doe not allow for any parallel processing. To take advantage of parallel processing the query should be split into ranges based on the PartitionKey, for example instead of

[cc lang='sql' ]Select * from Movies where Rating > 4[/cc]

Use

[cc lang='sql' ]

Select * from Movies where PartitionKey  >= ‘A’ and PartitionKey < ‘D’ and Rating > 4
Select * from Movies where PartitionKey  >= ‘D’ and PartitionKey < ‘G’ and Rating > 4

[/cc]

This enables the query to run in a parallel manner.

Similar to SQL Server, views can also be created to handle popular queries.

Be careful using ‘OR’ in queries. SQL Azure Tables do not do any optimization on these queries. It is optimal to split the query into several separate queries.

Entity Group Transactions

EGT’s offer transaction-like operations on an Azure table. Up to 100 insert/update/delete commands can be performed in a single transaction provided the payload is under 4MB.

Leave a Comment