Issues with Weaker Isolation Levels

7 min readMay 30, 2021

In the last few posts, we read about the two weak isolation levels: read committed and snapshot isolation. Although they both seem to solve the issues which could occur in a production system, there are still systems which prefer Serializable isolation. In this post, we will see what issues might arise even after using snapshot isolation.

Example Scenario

Consider the console example that we had earlier in the series. Just to recap, there are a limited number of PS5 consoles available but there are multiple potential buyers. Two users, Rahul and Sanjay, are trying to buy the consoles. They both have made the payment and the post-payment transaction consists of updating three tables in database:-

Inventory table: decrement inventory count for the number of units left.
Order table: create an entry for placing the order.
Invoice table: create an invoice corresponding to the successful buyer.

Suppose our database is acting under the snapshot isolation scenario. That is, our readers will not be blocked by writers. Suppose both Rahul and Sanjay come to the first step concurrently. They both will try to execute the following 2 queries:

select no_of_units_left from inventory_count where unit_id = 1234; //we get x (say)
update inventory_count set no_of_units_left = x — 1 where unit_id = 1234;

Suppose no_of_units_left = 5 when this scenario happens and the order of queries happen as follows:-

Rahul(T42): select no_of_units_left from inventory_count where unit_id = 1234;
> returns 5
Rahul(T42): update inventory_count set no_of_units_left = 4 where unit_id = 1234;
(current no_of_units_left = 4, but previous version of this row (which states no_of_units_left = 5) is kept) **
Sanjay(T43): select no_of_units_left from inventory_count where unit_id = 1234;
> returns 5 (since T42 is not yet committed, we will return the old version).**
Sanjay(T43): update inventory_count set no_of_units_left = 4 where unit_id = 1234;
…..
Further steps in transaction corresponding to order and invoicing
……
T42: commit
T43: commit

**: T42 and T43 will read from a consistent snapshot of the database before both transactions started. Hence, they both will read no_of_units_left = 5.

Thus, we see an anomaly. When the final no_of_units_left should have been 3 (2 units are sold), it is wrongly showing 4.

This issue is called the lost-update problem. One of the updates is lost. This issue happens generally in transactions with a read-modify-write cycle. Other examples of this issue could be updating the same account balance by two concurrent writers, editing the same wiki page by two concurrent users, etc.

Lost Update Issue and it’s Solution

This issue occurred because in snapshot isolation and in read committed, we saw how to unblock readers and ensure that any reader behaves properly in case of concurrent writes. However, we never addressed the issue of what two concurrent writers should read.

There are certain ways to solve the lost-update problem:-

Atomic write operations: Many databases give us the option for atomic update operations like incrementing and decrementing counters. In order to check which operations that your database executes in an atomic fashion, you need to consult their documentation.
Explicit locking: The application can perform an explicit lock while starting a read-modify-write cycle. Then, no other transaction can read this object till the executing read-modify-write cycle finishes.
Compare-and-set: These operations ensure that in case of a read-modify-write cycle, the update does not happen if the previously read value is modified.
Automatically detecting lost updates: One way that many databases deal with lost updates is that they do not do anything special, they just ensure while committing that the lost-update issue did not occur in this transaction. How do they do this? Considering the case of PGSQL in snapshot isolation mode, while updating, the database can check if the current version of the values is the same as when we started the transaction. And if it changed, then abort it.
Conflict resolution and replication: The above 4 points were about a single node database. But in the case of replicated databases operating in multi-leader or leaderless mode, we can run with additional issues. For example, Leader l1 successfully commits value v1 for an object o1, but Leader l2 has successfully committed value v2 for the same object o1. These issues are dealt with by databases by allowing multiple conflicting versions (called siblings) to remain valid on different nodes. However, when merging the various versions from different nodes, these multiple versions are merged using application code or special data structures.

Phantom Writes

Till now, we saw the problems that occur when the concurrent transactions were writing to the same object. We solved the case for reading those concurrent writes (via read committed and snapshot isolation level). Then we saw how two concurrent writes to the same object introduced a lost-update problem (caused by read-modify-write cycle in case of concurrent writes). We solved that using various ways (atomic updates, explicit locking, automatic error detection using MVCC, compare-and-set and application code). Now let us see some more subtle race conditions in case of concurrent writes. Suppose two concurrent write operations are having a read-modify-write cycle but they read the same object and write to two different objects.

Example Scenario

Going back to our PS5 console example, suppose there is one console corresponding to unit_id = 1234. When Rahul and Sanjay enter the transaction step 1, they both read the no_of_units_left field for unit_id = 1234 as 1. Since they both have seen that unit_id = 1234 is available, they both try to book it. Let us go into the second step of that transaction:-

Rahul(T42): select min(inventory_id) from inventory where unit_id = 1234;
>Returns ps5_2123
Sanjay(T43): select min(inventory_id) from inventory where unit_id = 1234;
> Returns ps5_2123
Rahul(T42): insert into order (order_id,buyer,inventory_id) values (1000, “Rahul”, ps5_2123);
Rahul(T42): update inventory_count set no_of_items_left = no_of_items_left — 1 where unit_id = 1234; // atomic operation
Sanjay(T43): insert into order (order_id, buyer, inventory_id) values (1001, “Sanjay”, ps5_2123);**
Rahul(T42): update invoice set user = “Rahul” where inventory_id = ps5_2123;
Rahul(T42):-> commit

Thus, we have 2 orders corresponding to the same inventory_id = ps5_2123 (** on the condition that T43 commits). This issue is called a write skew problem (read skew meant non repeatable read, write skew could be thought of as synonymous with non-repeatable writes). These writes belong to very specific concurrency cases, and it would be hard to reproduce them on production environments. Note that lost-update problems are just a special case of write-skews.

Another point to note here is that we are using snapshot isolation here. Thus, under that isolation level, the two writers would have blocked, had they been writing to the same object. But they are writing to 2 different objects (actually they are creating 2 new objects corresponding to a premise). Hence, snapshot isolation would not have solved this problem.

What actually occurred is that both transactions (T42 and T43) concurrently read the row for no_of_consoles_left as 1. They both acted on a premise and moved forward based on that premise (that count(*) for unit_id > 0). But immediately after one of the writes happened, the premise changed for the other transaction (Once T42 wrote ).

How to Solve Write-Skews

We can use explicit locking to solve write-skews to some extent (for example, we may use an explicit lock on the unit_id = 5678 when it is read). But it is not always feasible. Consider an example of claiming a username. In that case, we would need to perform a read-write query on the database on the condition that no username exists with the submitted name. That is, first we will perform a read query (for claiming username say “ishan”)

select count(*) from user where username = “ishan”;

Based on this result, we will proceed if count(*) comes as 0, and abort if count(*) comes = 1. Here, we cannot prevent 2 concurrent users from taking the username “ishan” by explicitly locking the object, since the object does not exist.

We can materialize the conflicts (so as to say) in order to avoid this problem. For example, we can first create rows corresponding to the username being claimed and then lock it. However, it is not a very good approach. Instead we should opt for Serializable isolation level.

What is Phantom?

In the write-skew problem, we noticed a trend. We perform a read query. This read query acts as a premise (as a pre-condition) to further write, or else abort. But suppose two concurrent readers (Rahul and Sanjay) read the database (to see that units_left as true). Both readers find the premise to be suitable for writing. Thus, they both go to write to the database. But as soon as one write occurs (say Rahul changes the units_left to false), the premise with which the other transaction proceeded would change (units_left = false). Thus, had these two run serially, then the other transaction would have not been able to proceed. This is called the phantom effect. In this effect, a write in one transaction changes the result of a read in another transaction.

Conclusion

In this post, we read about how the weaker transaction isolation levels are not able to solve all the concurrency issues that may occur in a transaction life cycle. We saw two very popular such issues: lost updates and write-skews. We also saw that explicit locking may help us with those issues, but it bears a huge performance cost. Thus, our entire discussion ended with going for Serializable Isolation. In the next post, we will read about Serializable Isolation and how databases may implement it.

For a deeper understanding of the subject matter, refer to the book mentioned in the bibliography. Note that this post is part of the series on Transactions:-

Bibliography

Kleppmann, M., 2017. Designing data-intensive applications: The big ideas behind reliable, scalable, and maintainable systems. O’Reilly Media, Inc.

Issues with Weaker Isolation Levels

Example Scenario

Lost Update Issue and it’s Solution

Phantom Writes

Example Scenario

How to Solve Write-Skews

What is Phantom?

Conclusion

Bibliography

Written by Ishan Mishra