Untitled
unknown
plain_text
2 years ago
10 kB
7
Indexable
Before we begin with lightweight transactions let's discuss the problem, what the problem is? So those of you who are familiar with Scylla know that data modification statements in Scylla do not return the result set, so you don't know whether there was a row or there wasn't a row and if there wasn't a row that was updated, then there will be the reason for this is that Scylla is built for high write availability so high throughput of writes and high availability of the cluster, you can write at any node, any node will accept your write and it will try to perform it quickly, here I try to draw a write path of a mutation in Scylla so it goes to the coordinator then it goes to replicas and well this is not a very fair picture because I don't assume a shard aware driver so most often it goes straight to the first replica but let's assume it goes to replica and a write on a replica doesn't really read anything that's an important part, it usually just appends the write to the memory table and appends the write to the commitlog which is very quick because the commitLog is batched so you flush lots of writes to disk at once, so this is how we achieve high write throughput and the underlying data structure for that is log-structured merge-trees which basically supports this write scenario and log-structured merge-trees are not good for reads, basically you can assume that they are a hundred times slower for reads than there are for writes there are some special cases when they can be faster than that but in general cases can be a hundred times and even more slower, so this is why write are fast and why writes do not return the result set the other reason why writes do not return the result set is client-side time stamps, so eventual consistency model assumes that you can actually assign the time stamp on a client and the history of mutations is built eventually you merge all of the mutations from all of the replicas as a concurrent update can change the same record on another replica and you can even retroactively change the history by adding a new mutation with an old timestamp so even if you get the record back you wouldn't know what to do with it you know it might be obsolete or might become obsolete, this is great but this is not what you always want, so to give you an example here this is an SQL user trying to use CQL and expecting that this statement is going to update a record about John Doe in the table and accidentally it actually inserts the record so the John Doe wasn't hired before now it's hired and it's hired with this join_date, this is not something you always need, sometimes you need the reliable database which Scylla is a scalable database but traditional consistency model like I want to update my change to the latest version of data and I want to make sure that whoever is coming after me actually sees my writes before a new update is applied and you can also see that WHERE clause semantics in Scylla is actually used to provide values, so this clause is taken and new clause is introduced IF clause, to convey the meaning, the intent to the database, hey I don't want to insert the record if it's not there and this is why a lightweight transactions are sometimes called conditional updates or conditional statements, so you can see that in this case the statement is not applied and basically you get what you want from the database so what else can you do with lightweight transactions, the conditional clause here can be quite rich so you can use conditions with all data manipulation statements its INSERTs, UPDATEs, DELETEs, there are some shortcuts like you can do IF EXISTS, IF NOT EXISTS if you just want to check the record, you can use expressions, you can use IN predicate you can use less than greater than, so it's very similar to WHERE clause and here I created a few examples, so we are going to discuss that lightweight transactions are more expensive than ordinary statements, than eventually consistent statements and in these examples I try to actually come up with good patterns for lightweight transactions so you don't always use it for all of your data, you use it for some critical pieces of your data where you do need strong consistency so in this case like bookings you don't want to make a booking twice, if the booking is made already you don't want to make it twice and another set of use cases is introduced by lightweight transactional batches, so conditional batches. What are conditional batches? If a batch has a conditional statement at least one the entire batch is transactional, the entire batch is applied atomically all or nothing and also the entire batch has a consistent read view of data, what it means? if you have multiple conditional statements in the batch all of these statements have the same view of data and it's guaranteed to be the latest view, it's guaranteed to be the latest view, by the way stop me or speed me up if I'm saying something trivial so I can I can skip that but I really thought it's important to just look at the basics first, so conditional batch has the latest view of the data and essentially conditional batches are very similar to classical transactions in traditional databases, the only difference is that if you have multiple conditions in the batch there is no ELSE branch if any of the condition is not true it's do nothing, so you can probably branch your logic in a classical database, you cannot do it with the batch, here I use an example I created an example where you can actually do something useful with the batch, so I have a static cell n_abandoned and I have a partition where there are all of the tasks that are associated with the project, so in this example I atomically update a static cell and delete all of the abandoned tasks in the project so this is a case when you might want to use a conditional batch to do multiple changes atomically. No questions so far? You mentioned that the batch statement, the conditions have a single view of the of the data - yes - the latest view - yes - but what is the latest view since we were talking about eventual consistency? - I'll get to that, thank you for the question So we have been talking about consistency and traditional consistency, there are many consistency levels in Scylla there is quorum, all, any and you might ask what consistency, are we adding new consistency levels? Are we using existing levels? So lightweight transactions are in a bit of its own world, they add its own consistency statement this is a grammar example, so this is a CQL part of a setting default consistency, there is a SERIAL and LOCAL_SERIAL and this is independent from other consistency levels, what this consistency means? Let me address the question, when you execute the condition you actually read data, so the order of execution is you search for it in the rows, you check conditions and if the conditions are true your apply updates, so when you check the conditions you read some version of data in order to make sure that the version of data that you read is the latest version lightweight transactions actually do not allow you to assign a timestamp to your writes, so they select the timestamp for you and this is how we ensure that the latest view is used when checking conditions and the SERIAL and LOCAL_SERIAL set whether it's the latest view from the data center or from the entire cluster so I will get to, this is usually like using LOCAL_SERIAL is a bit of advanced usage I'd say, you cannot have a partially rotten egg, so you can have only SERIAL consistency and the SERIAL consistency is the default but in some cases if you know what you're doing you can use LOCAL_SERIAL and you can actually tweak the standard consistency setting to improve performance, reduce latency, if we have time we'll get to that. I've been saying that IF it's very similar to WHERE, I also said that conditions are checked after search in the row, this is an important difference of if CLAUSE and WHERE clause, so WHERE clause actually can use a secondary index and can filter records at storage level IF clause is applied afterwards so it's like a predicate if the predicate is true we continue if it's not true we hold, how else are conditions different? They also apply to the searched row we also can use expressions, for now not all expressions are available, this has to do with some reconciliations we need to do with the features we added recently eventually the expression grammar, the expression power will be pretty much the same and some of the functions are not allowed, like in token function it doesn't make a lot of sense in conditions. What you can't do with lightweight transactions, by the way all of these features and restrictions so far are pretty similar to Cassandra I'm going to talk about the differences and some of the limitations we also inherited to be compatible with Cassandra, so you can't use counter data type, it doesn't make sense, you can't span, your lightweight transactions cannot span multiple partitions several different partitions may reside on different nodes so we don't do cross-node transactions yet, maybe we'll get to it sometime, you cannot supply custom timestamp, this actually upsets the entire logic of lightweight transactions, you can't supply UNLOGGED clause, it's ignored so the lightweight transactions are always logged they're always written to the commit log, so how are we different? I would like to conclude with the few differences that we have with Cassandra There aren't that many, we try to preserve compatibility where it made sense, there is one case which I would like to highlight. Scylla always provides a result set, what it means? Let's look at this example here you can see a result set of the batch statement the result set contains of the execution state, whether the mutation is applied or not and also the value of the old record that we use to check the conditions, so Cassandra for some reason does not return the old record if the condition is applied, this makes life quite messy on the client side because you cannot use prepared statements with lightweight transactions we decided if we always return the result set we are going to be compatible with most cases and make the client's life easier for drivers, so this is one inconsistency that we have and maybe we get feedback that this was not a good idea and we fix it or introduce a switch but so far we thought we're just going to do it better.
Editor is loading...