As I mentioned in my last blog post, I’m doing some work to bring the usage model of mongoose to .NET. As part of this I’m exploring implementation details of various NoSQL libraries in order to map CRUD-style semantics of each implementation onto a common higher-order model. I’m starting first with Azure Table Storage.
Each implementation has its quirks, but most of this work is reasonably straightforward. One aspect that did prove challenging and interesting (for Azure Tables anyway) is how to efficiently support deletion of many records at once.
Here’s the problem, in a nutshell. Azure Tables don’t support the SQL equivalent of this:
delete * from MyTable where ...
That is, there’s no way to specify a pattern to match entities against, and then delete those entities all at once. If this were possible, it might look something like this:
// THIS WILL NOT WORK private static void DeleteAllEntitiesEpicFail(CloudTable table) { var entityPattern = new DynamicTableEntity(); // here we specify only PartitionKey, but omit RowKey // we're hoping Azure will do a pattern match // for all entities in our table that have this PartitionKey, and delete them entityPattern.PartitionKey = "someKey"; entityPattern.ETag = "*"; table.Execute(TableOperation.Delete(entityPattern)); // http://www.sadtrombone.com }
In Azure tables, entity uniqueness in a table is determined by the combination of PartitionKey (which physically groups related entities together for efficiency) and RowKey (a unique key per-partition). So in the above (non-functional) code, we’re attempting to set PartitionKey and omit RowKey, assuming (hoping!) that Azure will find all entities with the given PartitionKey and delete them for us. Again, this doesn’t work. Entity delete operations against Azure tables require that both PartitionKey and RowKey be specified.
So an obvious solution would be to first retrieve the set of entities we want to delete, then delete them. Something like this:
private static void DeleteAllEntitiesOneAtATime(CloudTable table, Expression<Func<DynamicTableEntity, bool>> filter) { Action<IEnumerable<DynamicTableEntity>> processor = entities => { foreach (var entity in entities) { table.Execute(TableOperation.Delete(entity)); } }; ProcessEntities(table, processor, filter); }
private static void ProcessEntities(CloudTable table, Action<IEnumerable<DynamicTableEntity>> processor, Expression<Func<DynamicTableEntity, bool>> filter) { TableQuerySegment<DynamicTableEntity> segment = null; while (segment == null || segment.ContinuationToken != null) { if (filter == null) { segment = table.ExecuteQuerySegmented(new TableQuery().Take(100), segment == null ? null : segment.ContinuationToken); } else { var query = table.CreateQuery<DynamicTableEntity>().Where(filter).Take(100).AsTableQuery(); segment = query.ExecuteSegmented(segment == null ? null : segment.ContinuationToken); } processor(segment.Results); } }
This solution “works”, but it suffers from a few problems:
- Several round trips to the Azure storage service are required merely to obtain the list of entities to delete. This is very inefficient. We could play around with the number of entities returned per “chunk” (via the Take() call) but the fundamental problem remains.
- We execute another round trip per entity to delete each one. This performs very poorly for any real-world use of Azure tables.
- The ‘filter’ argument is scary looking and (at least with Azure tables) ultimately boils down to additional, potentially inefficient, work on the server to filter results before returning them.
Fortunately we have some better options to consider.
Batch Entity Deletes
As you would expect Azure tables support batch operations, so the first option to consider is batching delete operations together to reduce round trips to the cloud and back.
private static void DeleteAllEntitiesInBatches(CloudTable table, Expression<Func<DynamicTableEntity, bool>> filter) { Action<IEnumerable<DynamicTableEntity>> processor = entities => { var batches = new Dictionary<string, TableBatchOperation>(); foreach (var entity in entities) { TableBatchOperation batch = null; if (batches.TryGetValue(entity.PartitionKey, out batch) == false) { batches[entity.PartitionKey] = batch = new TableBatchOperation(); } batch.Add(TableOperation.Delete(entity)); if (batch.Count == 100) { table.ExecuteBatch(batch); batches[entity.PartitionKey] = new TableBatchOperation(); } } foreach (var batch in batches.Values) { if (batch.Count > 0) { table.ExecuteBatch(batch); } } }; ProcessEntities(table, processor, filter); }
This is better, but still not great. It improves performance by reducing round trips for the deletes, but we still have a lot happening up-front to gather our list of entities. This might work okay if deletes are relatively infrequent, but hitting this code more than occasionally, against a potentially large set of entities, is likely to yield a substandard experience.
Table-Per-Entity + Table Delete
So far we’ve made an implicit assumption in our design… delete operations accept a filter, so we’re assuming that we might want to delete a subset of all entities in a table, for any given top-level delete operation. This is a reasonable assumption for many scenarios, particularly if you’re approaching the problem from a “SQL mindset” where such behavior is relatively simple to design. But the benefits of NoSQL come with some responsibilities as well, one of which is that you must think deliberately about the data operations you need to support, and the relatively frequency and expense of each. Then you design your application to optimize the cases you care most about.
In the case of my NoSQL wrapper library, I care about delete operations but I really don’t care about filtering them discretely. The API I’ll expose allows either very discrete deletes (“delete this specific entity and nothing else”) or very broad deletes (“delete all entities of this type”). I don’t need to support anything else, at least initially. So knowing this, and having a better understanding of what it means under the covers to support filtered deletes, I can make an explicit design choice to forego support for filtered deletes and optimize for my scenario.
It turns out it’s very easy to delete an Azure table itself, data and all:
private static void DeleteEntitiesByDeletingTheTable(CloudTable table) { table.Delete(); }
This is a bit of a foreign concept to those with a strong SQL background. We don’t typically think of tables as transient objects. But in Azure it’s perfectly reasonable.
So I changed the internal design of my Azure table wrapper to ensure all instances of a given entity are stored in a single table, which allows me to use the “table delete” pattern and easily and efficiently support the delete scenarios I care most about.
Logical Entity Deletes + Offline Instance Cleanup
There’s one other scenario worth mentioning; it didn’t fit my needs but it might fit yours. If you find that physical deletes are undesirable (perhaps due to historical reporting or auditing support needs) you could still consider a per-entity-type flag to indicate the deletion of all instance of a given entity without doing an actual, physical delete. A few caveats with this approach:
- Lookup queries will be complicated by extra logic to check logical delete flags
- You’ll likely need some kind of offline, background process to occasionally cull logically deleted records from your tables; Azure Worker Roles and/or the Azure Scheduler are great for this
- Be careful that you don’t resort to a per-entity-instance logical delete flag, or you’ll ultimately suffer a similar fate as the “delete one instance at a time” scenario from above
Guidance
The important thing to keep in mind is that each situation is different; there is no one-size-fits-all solution for very large NoSQL databases. As I mentioned, the relatively rich feature sets and relatively modest data size of most traditional SQL databases masked many hard architecture and design problems that otherwise require a more deliberate, thoughtful approach in the world of Azure. All that said, there is some broad guidance to consider.
Infrequent, Batch Deletes
If you’ll mostly perform relatively infrequent deletes in large batches, then consider structuring your Azure tables in a table-per-entity-type fashion and using the “table delete” option described above. You’ll need to check for the existence of the table each time you attempt to use it, but that’s likely a small price to pay for a fast, efficient solution.
Infrequent, Discrete Deletes
If you’ll mostly perform discrete deletes (meaning, one or at most a handful of entities at a time), but do so infrequently, then consider the batch option described above. Performance will likely suffer a bit with very large data sets, but since deletes are infrequent the pain won’t be felt very often, relative to other operations.
Frequent, Batch Deletes
If you’ll frequently perform deletes in large batches, you have a choice of either using the “table-per-entity-type + table delete” model, the logical delete equivalent of that, or (if you can’t devote a table-per-type) the batch delete model. Again, batching using TableBatchOperation isn’t likely to perform very well when used often and for large data sets. This may be an opportunity for you to consider whether you even need to support deletes at all?
Frequent, Discrete Deletes
In this final case there are frankly no good options for Azure tables. You’re likely to end up using something similar to the batch model above (using TableBatchOperation) which again won’t yield a pleasant experience for most real-world situations. One possibility to consider here is whether you truly need a NoSQL solution like Azure tables, or whether a relational database (which tends to excel in these scenarios, at least up to a certain data set size) would be better suited to your needs.