Scaling Databases – Web Development


I want to talk for a moment about how to scale databases. Now, like databases themselves, this is a very broad topic, but there’s a couple concepts I’d like to introduce because they’re going to affect us in this class. There’s two reasons you might need to scale a database. One is too much load, as in a database that’s just doing too much work. You’ve got some website that is getting millions of requests a day–thousands of requests a second– and you’ve got this one machine that’s got your data on it and it just can’t keep up with all the work. So what you might do in this case is take your database– which are often represented as little cynlinders– and replicate it to other databases. So every time we insert a piece of data to this database, we send it over to the other guys. So all of our database writes go into this one master database. In turn, all of that data gets replicated to these slave databases. So now we have 3 databases with all of our data on it, and so if you’ve got a site like Reddit that’s getting thousands and thousands of requests a second, instead of sending all of the database reads– all the lookups–to this database, we can send it to these databases. And that alleviates a lot of this load off of this master database. Now, there are much more complicated schemes where you might have a multi-master system or a bunch of databases that are all working together, but this is a fairly common setup and a fairly common approach to spreading around the read. Because generally in most systems you have a lot more reads than you do writes. So if your master can keep up with all of the writes, your slaves can handle all of the reads. There are some downsides to this. One is that it doesn’t increase the speed of writes. We’re still bottlenecked by this master receiving all this data, who in turn has to send it to all these slaves. So if we can still fit all our writes on this one machine, we’re in good shape and we can spread the reads over as many machines as we need. Some of these slaves can even replicate the other machines. Another downside is this notion of replication lag. And this occurs when you write a piece of data to the master but the read hits one of these slaves before the data has propagated, and that’s called replication lag. You can sometimes get some funny behavior. And the reason I bring it up now is because in the database we’ll be using in this class, you can occasionally see symptoms of this type of behavior. It’s not necessarily owing to a master/slave setup, but this is the general concept. What happens if you have too much data for one machine? The replication handles the case where you have too much load. Basically, you’ve got to do too many reads, so you can replicate your database–scale this up. What if you just have too much data, your master database–or any one of these databases–can’t even hold of your data, it doesn’t fit in memory or even in disk, there’s just too much? One of the things you can do is you can shard the database. This is a fairly simple approach, where instead of having one master you might have a couple that are all the same size, despite my drawing. So let’s say we’re storing our links database. One approach to shardding might be to store links 1-100 here– actually, more like 1-a billion here– and 101-200 here and 201-300 here, and basically you store some of your links here and some of your links here and some of your links here in this database. Obviously, 1-100 probably isn’t the correct approach. You could probably use a hashing approach as if we had a hashtable and these were all cells in our hashtable and you hash on some particular key. In this case, I’m referring to link ids–the id of whatever you’re storing– to figure out which database you want to store it in. This is cool because now if we triple our write load from this scenario over here, we have 3 database machines to handle it. Likewise, we already have 3 machines to handle the read load. And of course you can replicate these machines as well. A lot of systems both shard and replicate your machines if you really want to get fancy. But there are some downsides to shardding as well. One of the downsides is the queries get much more complex. What if I said that this helps the case of find me the link whose id is 150? We just say, okay, which does 150 go to? You know what hash is to this machine, and we do our read and that’s that. What if I said get me all the links sorted by hotness, for example. Well, we don’t know where that link may reside, especially if we’re using a hashing algorithm that’s more sophisticated than 1-100. That’s called a range query, and a range query might have to hit all of your machines. Then we’ve lost one of our advantages to using this. Now, you could replicate these to spread that around, but you have to hit all these machines and then merge the results probably and do the sort again and memory. Some queries become very, very complex. Another downside is that joins become difficult or impossible. That makes sense. If we have this massive database that has multiple tables on it, we can do joints. But if we start shardding our database up– we have one database that doesn’t even fit on one machine. So if we can’t fit one database on our machine, how are we going to fit multiple databases? This notion that you have all of your tables in the same place to do a join– to do those fancy join SQL queries– becomes a lot more difficult. Now, there’s a lot of research going on of building systems that can overcome these downsides, but if you take the naive approach to replicating and charting, these are some of the things you’ll have to deal with. Generally the benefits outweigh the downsides, because if you design your data in such a way that you don’t need to do joins or you don’t need to do complex queries, you’re in very good shape with both of these. Or maybe you have multiple setups– one setup for handling your general load and another setup for handling your complex queries. That’s okay, and that’s actually very common. Now, I wanted to introduce these concepts because the database we’ll be using in this class– Google the App Engine Datastore– actually has some of these limitations. There are no joins allowed, even though it does provide a SQL interface. And a lot of more complex queries that you could do in a general SQL database, you can’t do in Google Datastore, but that’s okay. You get this really nice benefit of having a database that is shardded and replicated to wazoo and back. It’s actually very fast and reliable. You’re probably not going to have to worry about systems crashing. It’s something that replicating a database can help with and shardding a database can help with. Okay, so I just wanted to introduce those concepts. Let’s have a quick quiz on it before we move along. Which is an appropriate technique for increasing the read speed from a database? Get a faster machine. Replicate the database. Store less data. Or press the turbo button.

Danny Hutson

Leave a Reply

Your email address will not be published. Required fields are marked *