Redis Presharding

simonw · on Feb 25, 2011

The one thing that puts me off Redis sharding is that it breaks cross-key operations, like sunion and sinter. I get a lot of use out of those.

I wonder if it would make sense for Redis to support cross key operations where you can tell it another Redis server to retrieve the keys from?

    # Connect to redis server 1
    r = Redis('server1.dev', 6379)
    # Tell it to perform an intersection with a key on server 2
    r.sinterremote('my-key', 'server2.dev:6379/remote-key')
    # Same, but store the result on server 1
    r.sinterremotestore('local-storage-key', 'my-key', 'server2.dev:6379/remote-key')

antirez · on Feb 25, 2011

Hi Simon, this is a good question. What I think is that this ops ultimately will be too slow to be useful, anyway you need a massive amount of data transfer to compare things in different instances.

But Redis Cluster supports the MIGRATE command, so what we plan is to provide the client with a very cool way to do this kin do things.

Basically MIGRATE is able to transfer a key from an instance to another (or to copy it). The performance is pretty impressive. So the idea is that you can migrate two keys from two different instances into a "temporary" instance that you use just for computation, and then perform what you want in this instance.

This way we'll provide all the tools needed to people that understand what they are doing to do that. But will not provide direct support for something that in the general case can't be super fast and horizontally scalable.

simonw · on Feb 25, 2011

It sounds like I'll just need to make sure I keep sets I'm going to need to intersect with each other on the same node. They're usually sets of integers so that shouldn't be too much of a problem for the kinds of thing I use Redis for.

antirez · on Feb 25, 2011

Cool that it works for you, but for an use case where you have many distributed sets, and you need to intersect cross-node, the simplest thing you could do is to get the smallest transfered in the other node via MIGRATE against a temporary key, intersect, delete it. At least we know there is an exit strategy :)

jaksprats · on Feb 25, 2011

that will work (as good as can be hoped for).

the process can be thought of as a map-reduce. first pass would be figure out where all your SETs are located and how big they are. then intersect all co-located SETs. then (in parallel) take smaller SETs and MIGRATE them towards larger SETS, intersect those and repeat (respecting new sizes). A pyramid of SET intersection can be done until the result is the final two SETs being intersected, a process will need to coordinate all of this, but luckily intersections are idempotent, so if one parallel job finishes quicker, it doesn't need to wait.

W/ redis speed, the overhead (for the coordinator) to get the size of the different SETs in all these steps should be minimal as compared to the time taken to MIGRATE the data (meaning for-free). Also w/ redis' speed the MIGRATE will be very fast and the cross-node-join (or cross-node-intersecton) bottlenecks on network I/O, so if a good framework for this redis map-reduce is created, it will be a pretty optimal setup, and it wont bloat the server w/ tasks that can not be done directly in the server (cross-node-intersections are done at something more like a proxy level, they require intermediate results).

this problem is a hard one, data analytic stores optimise to this problem by storing data redundantly w/ a "pre-joined" colocation strategy, which works for star schemas and a limited number of tables, but doesnt make sense w/ 1000s of SETs, so this is a real good solution and classic redis anti-bloat.

bretthoerner · on Feb 25, 2011

We have the same need at Disqus, and I talked about our solution here: http://news.ycombinator.com/item?id=2247132

tlrobinson · on Feb 25, 2011

So if I understand correctly this is just a neat trick using existing Redis features (and relying on it's low memory footprint) to easily implement sharding today?

But in an ideal world (Redis cluster) you would run a single node per machine, and Redis would automatically handle sharding, replication etc?

antirez · on Feb 25, 2011

More or less yes. Actually with Redis Cluster you probably want one instance per core anyway, that will use all your CPU power. But Redis will handle all the sharding, adding new nodes, removing nodes, selecting the right node, and so forth.

The point of the article is that with what you have now as stable releases you can mount a schema that may work for you.

liuliu · on Feb 25, 2011

Process migration (moving instance) is not easy. In the simplest level, you still want to have some kind of naming/binding scheme such that the migration will be transparent in application layer (well, DNS, maybe). Besides, with your proposal in "Moving Instance", there are still race conditions. For example, when you "slaveof no one"ed on the new instance, which instance (old or new) will handle the incoming request? It seems to me either way, you will end up in an inconsistent state (or handle some request faultily). Well, I need to lookup the underlying mechanism of "slaveof" to be confident about that, but it is worth to point out.

I'd like to be able to use Redis Cluster as soon as it is available, here is my question: how can I help? Recently, I want to start write some script for rehashing, but there are some problems, namely, I cannot just "get/set" everything (e.g., I cannot "get/set" a list with some serialized format). Do you have recommended way of doing that?

antirez · on Feb 25, 2011

Hello liuliu, there is no race:

when all the clients are pointed to write in the new instance, that is a slave, but as all Redis slaves can accept writes, you are already ok. Then you elect the instance as master just to disconnect the real master without issues.

Redis Cluster is something where we (Redis core team, that is, I and Pieter) can't get help for now, but will need help soon with testing.

soamv · on Feb 25, 2011

Some client A might write to the master, another client B might write to the slave, and other clients will see these updates in a different order depending on which instance they are connected to. Can that happen?

moe · on Feb 25, 2011

Count me in, I'm also eagerly waiting for the cluster feature.

Have you considered relying on Zookeeper for the gossip heavy-lifting?

Yes it's java (ugh) and would add an external dependency (ugh), but it's battle-tested and could perhaps save you from re-inventing such a complex wheel?

jrockway · on Feb 25, 2011

Considering they won't use a library for the networking, I doubt they are going to add a Java dependency for config management.

moe · on Feb 25, 2011

It would not be a java dependency but rather a component dependency (you talk to zookeeper over the network).

Zookeeper is also not just config management, it's state management. It helps to solve some of the really nasty problems of distributed state, e.g. achieving quorum and dealing with split brain situations.

However, it's been a while since I read up on the planned redis-cluster design, so I'm not sure how they are tackling these issues.

dlsspy · on Feb 25, 2011

We expand and shrink clusters in membase under heavy load with no issues. The worst thing that happens is that it takes longer some times. There is no inconsistency.

It doesn't look like redis is doing exactly what we're doing, but since it's been doing it well for a long time now with plenty of documentation, I'm hoping the knowledge we've shared about it can help others learn how to do similar things.

antirez · on Feb 25, 2011

To be honest when I saw the membase design what I thought was: hey, this is from the redis cluster design (that predates membase announcement a lot).

But then I realized that it's pretty common to invent the same things. You should try doing the same, especially since the idea was made public by Redis before. Check yourself in the git history.

dlsspy · on Feb 25, 2011

The blog post is from June, but the code I'm talking about predates redis by about three years.

antirez · on Feb 25, 2011

"But then I realized that it's pretty common to invent the same things. You should try doing the same".

That said, I really don't care at all. I want to build stuff that work, I want to have fun, understand new things. This is why I build Redis.

mythz · on Feb 25, 2011

"I'm hoping the knowledge we've shared about it can help others learn how to do similar things." "The blog post is from June, but the code I'm talking about predates redis by about three years."

@dippsy what exactly are you insinuating here? It sounds like your suggesting #Redis somehow has borrowed/learned from concepts in your unreleased code?? It's actually quite rude.

janl · on Feb 25, 2011

No, he was replying to @antirez request of "Check yourself in the git history".

I don't think anybody is insinuating anything here. The same solution to a common problem has been found independently. This happens all the time (See the Erlang process model and the Actor model e.g.).

Down the line, this isn't even important. Can we move along now and build some awesome databases?

mythz · on Feb 25, 2011

yes please.

dlsspy · on Feb 25, 2011

See also: How we've been doing this in memcached and membase

http://dustin.github.com/2010/06/29/memcached-vbuckets.html

antirez · on Feb 25, 2011

vbuckets are exactly one of the ideas proposed in one of the old design document of Redis Cluster. And indeed the final Redis Cluster is using this concept. In our land they are called "Hash Slots".

I think the concept is much more interesting then consistent hashing, as it is a discrete way to see the problem. Dividing the key space into N pieces, and assigning this pieces to different nodes. Resharding this way is easy, just move one hash slot from one instance to another. For moving data around Redis uses the MIGRATE command.

dlsspy · on Feb 25, 2011

The other subtlety is in being able to prevent the server from servicing requests when it shouldn't be (to avoid consistency issues).

What you don't want is a client not aware of the new configuration -- or reading and writing before the new one is ready. That's where vbucket states allow us to atomically move things.

It's not obvious from your doc how one can actually accomplish this with zero downtime while maintaining consistency.

chubot · on Feb 25, 2011

How about just using consistent hashing? http://en.wikipedia.org/wiki/Consistent_hashing

In this scheme, you can add and remove servers without mving all your data.

There is apparently a memcached plugin to do this: http://www.mikeperham.com/2009/01/14/consistent-hashing-in-m...

moe · on Feb 25, 2011

Consistent hashing is a common baseline, but it's not the hardest problem in this.

You still have to move some data when you shuffle nodes and the devil is in the detail there; when/how to move chunks without interrupting access or violating integrity, how to ensure n-copies (if redundancy is required), how to prevent hot-spots. And of course all the aggregation/routing to allow complex queries that span nodes.

That's the reason why there are very few out-of-the-box sharding solutions. It's just very tough to get right, and one size won't fit all.

antirez · on Feb 25, 2011

The idea of presharding is exactly to avoid using consistent hashing. With memcached consistent hashing works as you are only interested in relocating the smallest amount of keys when adding nodes, you don't need to rehash, as memcached is a cache: eventually the new values will be created at the right position. With Redis the idea does not work well, you need to move things among instances. Instead with presharding this is not required.

EGreg · on Feb 25, 2011

Personally, I would prefer to hash my fields like this:

1. Hash each field in the key in some fashion

2. Concatenate the hashes, for example implode(".", $hashes)

3. Now you've got a point in the space of all possible concatenated hashes. On each client, have a partition like this for each key:

["", "abf34.9f8c7.92833", ...]

4. go through the partition and find the last point that is smaller than your point. That point maps to your shard.

5. Monitor your machines. If one of your shards becomes too full, or too much action is going, SPLIT IT. You do this simply by replicating the data to one or more new shards (like it says in the article) introducing one or more points in the partition right after the point mapping to the shard. Point them to the new shards. And update all the clients!

That's all Greg