Couch Cluster
A “Couch Cluster” is composed of multiple “partitions”. Each partition is composed of multiple replicated DB instances. We call each replica a “virtual node”, which is basically a DB instance hosted inside a "physical node", which is a CouchDB process running in a machine. “Virtual node” can migrate across machines (which we also call “physical node”) for various reasons, such as …
- when physical node crashes
- when more physical nodes are provisioned
- when the workload of physical nodes are unbalanced
- when there is a need to reduce latency by migrating closed to the client
Proxy
The "Couch Cluster" is frontend by a "Proxy", which intercept all the client's call and forward it to the corresponding "virtual node". In doing so, the proxy has a "configuration DB" which store the topology and knows how the virtual nodes are distributed across physical nodes. The configuration DB will be updated at more DBs are created or destroyed. Changes of the configuration DB will be replicated among the proxies so each of them will eventually share the same picture of the cluster topology.
In this diagram, it shows a single DB, which is split into 2 partitions (the blue and orange partitions). Each partition has 3 replicas, where one of them is the primary and the other two are secondaries.
Create DB
- Client call Proxy with URL=http://proxy/dbname; HTTP_Command = PUT /dbname
- Proxy need to determine number of partitions and number of replications is needed, lets say we have 2 partitions and each partition has 3 copies. So there will be 6 virtual nodes. v1-1, v1-2, v1-3, v2-1, v2-2, v2-3.
- Proxy also need to determine which virtual node is the primary of its partition. Lets say v1-1, v2-1 are primary and the rest are secondaries.
- And then Proxy need to determine which physical node is hosting these virtual nodes. say M1 (v1-1, v2-2), M2 (v1-2, v2-3), M3 (v1-3, v2-1).
- Proxy record its decision to the configuration DB
- Proxy call M1 with URL=http://M1/dbname_p1; HTTP_Command = PUT /dbname_p1. And then call M1 again with URL=http://M1/dbname_p2; HTTP_Command = PUT /dbname_p2.
- Proxy repeat step 6 to M2, M3
List all DBs
- Client call Proxy with URL=http://proxy/_all_dbs; HTTP_Command = GET /_all_dbs
- Proxy lookup the configuration DB to determine all the DBs
- Proxy return to client
Get DB info
- Client call Proxy with URL=http://proxy/dbname; HTTP_Command = GET /dbname
- Proxy will lookup the configuration DB for all its partitions. For each partition, it locates the virtual node that host the primary copy (v1-1, v2-1). It also identifies the physical node that host these virtual nodes (M1, M3).
- For each physical node, say M1, the proxy call it with URL=http://M1/dbname_p1; HTTP_Command = GET /dbname
- Proxy do the same to M3
- Proxy combine the results of M1, and M3 and then forward to the client
Delete DB
- Client call Proxy with URL=http://proxy/dbname; HTTP_Command = DELETE /dbname
- Proxy lookup which machines is hosting the clustered DB and find M1, M2, M3.
- Proxy call M1 with URL=http://M1/dbname_p1; HTTP_Command = DELETE /dbname_p1. Then Proxy call M1 again with URL=http://M1/dbname_p2; HTTP_Command = DELETE /dbname_p2.
- Proxy do the same to M2, M3
Get all documents of a DB
- Client call Proxy with URL=http://proxy/dbname/_all_docs; HTTP_Command = GET /dbname/_all_docs
- Proxy will lookup the configuration DB for all its partitions. For each partition, it randomly locates the virtual node that host a copy (v1-2, v2-2). It also identifies the physical node that host these virtual nodes (M1, M2).
- Proxy call M1 with URL=http://M1/dbname_p1/_all_docs; HTTP_Command = GET /dbname_p1/_all_docs.
- Proxy do the same to M2
- Proxy combine the results of M1, and M3 and then forward to the client
Create / Update a document
- Client call Proxy with URL=http://proxy/dbname/docid; HTTP_Command = PUT /dbname/docid
- Proxy will invoke "select_partition(docid)" to determine the partition, and then lookup the primary copy of that partition (e.g. v1-1). It also identifies the physical node (e.g. M1) that host this virtual node.
- The proxy call M1 with URL=http://M1/dbname_p1/docid; HTTP_Command = PUT /dbname_p1/docid
Get a document
- Client call Proxy with URL=http://proxy/dbname/docid; HTTP_Command = GET /dbname/docid
- Proxy will invoke "select_partition(docid)" to determine the partition, and then randomly get a copy of that partition (e.g. v1-3). It also identifies the physical node (e.g. M3) that host this virtual node.
- The proxy call M3 with URL=http://M3/dbname_p1/docid; HTTP_Command = GET /dbname_p1/docid
Delete a document
- Client call Proxy with URL=http://proxy/dbname/docid?rev=1234; HTTP_Command = DELETE /dbname/docid?rev=1234
- Proxy will invoke "select_partition(docid)" to determine the partition, and then lookup the primary copy of that partition (e.g. v1-1). It also identifies the physical node (e.g. M1) that host this virtual node.
- The proxy call M1 with URL=http://M1/dbname_p1/docid?rev=1234; HTTP_Command = DELETE /dbname_p1/docid?rev=1234
Create a View design doc
- Client call Proxy with URL=http://proxy/dbname/_design/viewid; HTTP_Command = PUT /dbname/_design/viewid
- Proxy will determine all the virtual nodes of this DB, and identifies all the physical nodes (e.g. M1, M2, M3) that host these virtual nodes.
- The proxy call M1 with URL=http://M1/dbname_p1/_design/viewid; HTTP_Command = PUT /dbname_p1/_design/viewid. Then proxy call M1 again with URL=http://M1/dbname_p2/_design/viewid; HTTP_Command = PUT /dbname_p2/_design/viewid.
- Proxy do the same to M2, M3
Query a View
- Client call Proxy with URL=http://proxy/dbname/_view/viewid/attrid; HTTP_Command = GET /dbname/_view/viewid/attrid
- Proxy will determine all the partitions of "dbname", and for each partition, it randomly get a copy of that partition (e.g. v1-3, v2-2). It also identifies the physical node (e.g. M1, M3) that host these virtual nodes.
- The proxy call M1 with URL=http://M1/dbname_p1/_view/viewid/attrid; HTTP_Command = GET /dbname_p1/_view/viewid/attrid
- The proxy do the same to M3
- The proxy combines the result from M1, M3. If the "attrid" is a map only function, the proxy will just concatenate all the results together. But if the "attrid" has a reduce function defined, then the proxy will invoke the view engine's reduce() function with rereduce = true. Then the proxy return the combined result to the client.
Replication within the Cluster
- Periodically, Proxy will replicate the changes of ConfigurationDB among themselves. This will ensure all the proxies are having the same picture of the topology.
- Periodically, Proxy will pick a DB, pick one of its partition, and replicate the changes from the primary to all the secondaries. This will make sure all the copies of each partition of DB are in sync.
Client data sync
Lets say the client also has a local DB, which is replicated from the cluster. This is important for occasionally connected scenario, where the client may disconnect with the cluster for a time period and work with the local DB for a while. Later on when the client connects back to the cluster again, the data between the local DB and the cluster need to be synchronized.
To replicate changes from the local DB to the cluster ...
- Client start a replicator, and send the POST /_replicate with {source : "http://localhost/localdb, target: "http://proxy/dbname"}
- The replicator, which has remembered the last seq_num of the source in the previous replication, and extract all the changes of the localDB since then.
- The replicator push these changes to the proxy.
- The proxy will examine the list of changes. For each changed document, it will call "select_partition(docid)" to determine the partition, and then lookup the primary copy of that partition and then the physical node that host this virtual node.
- The proxy will push this changed document to the physical node. In other words, the primary copy of the cluster will first receive the changes from the localDB. These changes will be replicated to the secondary copies at a latter time.
- When complete, the replicator will update the seq_num for the next replication.
- Client starts the replicator, which has remembered the last "seq_num" array of the cluster. The seq_num array contains all the seq_num of each virtual node of the cluster. This seq_num array is a opaque data structure which the replicator doesn't care.
- The replicator send a request to the proxy to extract the latest changs, along with the seq_num array
- The proxy first lookup who is the primary of each partition, and then it extract changes from them using the appropriate seq_num from the seq_num array.
- The proxy consolidate all changes from each primary copy of each partition, and send them back to the replicator, along with the updated array of seq_num.
- The replicator apply these changes to the localDB, and then update the seq_num array for the next replication.
6 comments:
Ricky this is flipping amazing!!!!!! You are the man!!!!
Thank you for your amazing outline of an excellent couchdb cluster.
I was great to meet you a QCon!
This outline is solid - thanks for putting it out there. I'll be curious to see if partitioning data like this makes more sense than just running huge huge multi master cluters (where each master has the full data set). I have a feeling datacenter trends are moving to wide cpu/io, big disk servers.
Either way, I think you've tackled sharding here for sure.
If create/update document results in writing the doc to only the primary virtual node in its partition, a read which would follow may be directed to a secondary virtual node and produce a stale version. The periodic sync-up does not guarantee me anything, it's just a best effort attempt to synchronize. The basic premise - "read what was written", - may be broken.
You should google the term "eventual consistency".
Not sure if eventual consistency works for me, but it is all couchdb actually promises.
In a single db instance mode, couchdb guarantees that if I write a document and then read it right back I get exactly what I wrote.
When I have 2 db instances set up for replication, the same rule stands as long as for a given document I direct writes and reads to the same instance.
However, the scheme proposed above silently breaks even that minimal guarantee.
To get "read what was written" you just need to implement stick sessions of some kind. Facebook does this by setting a cookie when users write. If your cookie is younger than N seconds, all your reads are directed to the write-master. Yes, it's a leak in the abstraction, but then again, so is time itself. :)
Post a Comment