- It calls for a more relaxed data consistency model
- It provides primitive querying and searching capability
Here I like to give myself a try to tackle the second issue.
So what is the problem ?
Many of the NoSQL DB today is based on the DHT (Distributed Hash Table) model, which provides a hashtable access semantics. To access or modify any object data, the client is required to supply the primary key of the object, then the DB will lookup the object using an equality match to the supplied key.
For example, if we use DHT to implement a customer DB, we can choose the customer id as the key. And then we can get/set/operate on any customer object if we know its id
- cust_data = DHT.get(cust_id)
- DHT.set(cust_id, modified_cust_data)
- DHT.execute(cust_id, func(cust) {cust.add_credit(200)})
Using our customer DB example, we may do ...
- Lookup customers within a zip code
- Lookup customers whose income is above 200K
- Lookup customer using keywords "chief executives"
It seems to me that the responsibility of building an indexing and query mechanism lands on the NoSQL user. Therefore, I want to explore some possible techniques to handle these.
Companion SQL-DB
A very straighforward approach is provide querying capability is to augment NoSQL with an RDBMS or TextDB (for keyword search). e.g. We add the metadata of the object into a RDBMS so we can query its metadata using standard SQL query.
Of course, this requires the RDBMS to be large enough to store the search-able attributes of each object. Since we only store the attributes required for search, rather than the whole object into the RDBMS, this turns out to be a very practical and common approach.
Scatter/Gather Local Search
Some of the NOSQL DB provides indexing and query processing mechanism within the local DB. In this case, we can have the query processor broadcast the query to every node in the DHT where a local search will be conducted with results sent back to the query processor which aggregates into a single response.
Notice that the search is happening in parallel across all nodes in the DHT.
Distributed B-Tree
B+Tree is a common indexing structure using in RDBMS. A distributed version of B+Tree can also be used in a DHT environment. The basic idea is to hash the search-able attribute to locate the root node of the B+ Tree. The "value" of the root node contains the id of its children node. So the client can then issue another DHT lookup call to find the children node. Continue this process, the client eventually navigate down to the leaf node, where the object id of the matching the search criteria is found. Then the client will issue another DHT lookup to extract the actual object.
Caution is needed when the B+Tree node is updated due to split/merge caused by object creation and deletion. This should be ideally done in an atomic fashion. This paper from Microsoft, HP and Toronto U describe a distributed transaction protocol to provide the required atomicity. Distributed transaction is an expensive operation but its uses here is justified because most of the B+ tree updates rarely involve more than a single machine.
Prefix Hash Table (distributed Trie)
Trie is an alternative data structure, where every path (from the root) contains the prefix of the key. Basically, every node in the Trie contains all the data whose key is prefixed by it. Berkeley and Intel research has a paper to describe this mechanism.
1. Lookup a key
To locate a particular key, we start with its one digit prefix and do a DHT lookup to see if we get a leaf node. If so, we search within this leaf node as we know the key must be contained inside. If it is not a leaf node, we extend the prefix with an extra digit and repeat the whole process again.
# Locate the key next to input key
def locate(key)
leaf = locate_leaf_node(key)
return leaf.locate(key)
end
# Locate leaf node containing input key
def locate_leaf_node(key)
for (i in 1 .. key.length)
node = DHT.lookup(key.prefix(n))
return node if node.is_leaf?
end
raise exception
end
2. Range Query
Perform a range query can be done by first locate the leaf node that contains the start key and then walk in the ascending order direction until we exceed the end key. Note that we can walk across a leaf node by following the leaf node chain.
def range(startkey, endkey) {
result = Array.new
leaf = locate_leaf_node(startkey)
while leaf != nil
result.append(leaf.range(startkey, endkey))
if (leaf.largestkey < endkey)
leaf = leaf.nextleaf
end
end
return result
end
To speedup the search, we can use a parallel search mechanism. Instead of walking from the start key in a sequential manner, we can find the common prefix of the start key and end key (as we know all the result is under its subtree) and perform a parallel search of the children leaf nodes of this subtree.3. Insert and Delete keys
To insert a new key, we first identify the leaf node that contains the inserted key. If the leaf node has available capacity (less than B keys), then simply add it there. Otherwise, we need to split the leaf node into two branches and redistribute its existing keys to the newly created child nodes.
To delete a key, we similarly identify the leaf node that contains the deleted key and then remove it there. This may cause some of my parents to have less than B + 1 keys so I may need to merge some child nodes.
Combining Multiple Search Criteria
When we have multiple criteria in the search, each criteria may use a different index that resides within a different set of machines in the DHT. Multiple criterias can be combined using boolean operators such as OR / AND. Performing OR operation is very straightforward because we just need to union the results of each individual index search that is performed separately. On the other hand, performing AND operation is trickier because we need to deal with the situation that each individual criteria may have a large number of matches but their intersection is small. The challenge is: how can we efficiently perform an intersection between two potentially very large sets ?
One naive implementation is to send all matched object ids of each criteria to a server that performs the set intersection. If each data set is large, this approach may cause a large bandwidth consumption for sending across all the potential object ids.
A number of techniques are described here in this paper
1. Bloom Filter
Instead of sending the whole set of matched object id, we can send a more compact representation called "Bloom Filter". Bloom filter is a much more compact representation that can be used for testing set membership. The output has zero false negative, but has a chance of false positive p, which is controllable.
For minimizing bandwidth, we typically pick the one with the larger set as the sending machine and perform the intersection on the receiving machine who has the smaller set.
Notice that the false positive can actually be completely eliminated by sending the matched result of Set2 back to Set1 machine, which double check the membership of set1 again. In most cases, 100% precision is not needed and a small probability of false positive is often acceptable.
2. Caching
It is possible that certain search criteria is very popular and will be issued over and over again. The corresponding bloom filter of this hot spots can be cached in the receiving machine. Since the bloom filter has a small footprint, we can cache a lot of bloom filters of popular search criterias.
3. Incremental fetch
In case if the client doesn't need to get the full set of matched results, we can stream the data back to client using a cursor mode. Basically, at the sending side, set1 is sorted and broken into smaller chunks with a bloom filter computed and attached to each chunk. At the receiving side, every element of set2 is checked for every bloom filter per chunk.
Notice that we save computation at the sending side (compute the bloom filter for the chunk rather than the whole set1) at the cost of doing more at the receiving side (since we need to repeat the checking of the whole set2 for each chunk of set1). The assumption is that client only needs a small subset of all the matched data.
An optimization we can do is to mark the range of each chunk in set1 and ask set2 to skip the objects that falls within the same range.
9 comments:
Interesting post. But I don't think it's useful to lump all NOSQL options into a DHT data model. There's a bunch of implementations out there with data models that are much more powerful than key-value pairs, such as the column family / BigTable guys (Hypertable, HBase, Cassandra), document stores (Couch, Mongo, Riak) and graph databases (Neo4j, AllegroGraph, VertexDB).
When it comes to NOSQL data models, I actually think the trend is going in the opposite direction: people realize they don't really need the extreme Google-size scalability of the pure key-value stores (very few do) and go for a NOSQL option with a data model that is more powerful and therefore can represent their data sets more naturally and express their queries more easily. I wrote about this tradeoff in scaling to size vs scaling to complexity.
In the Neo4j project we've observed three things on this topic:
1. When you write an application, you almost exclusively have static queries. You know at development time that in this module, you'll be handed a customer id from the session and you wanna fetch all shopping carts for that customer. When the query is known ahead of time like this (i.e. a "static query"), the preferred way to express it is through an API: it's native to your programming language, you have compiler support, IDE/tool support, typesafety if you choose to etc. None of which is true of embedded strings of some query language.
2. Dynamic queries in applications are almost always of a "Google search text field" type. Now, this is the domain of full text search engines and Lucene, Solr, Sphinx, etc is your friend here.
3. Finally, where you do want
dynamic queries is for reporting and for runtime introspection of your data for operations purposes (i.e. when the shit has hit the fan and you want to live introspect what's going on).
I hope this article discusses use case #3. Too many still assume a dynamic query language should be used for #1 & #2, which I think is simply an artifact of how we've used SQL all these years.
In Neo4j, we use a native Java/Ruby/Python/Scala/Clojure API for use case #1, have nice integration with for example Lucene for #2 and use SPARQL (an SQL-like query language for graphs, standardized by w3c) for use case #3. Works out really well.
-EE
@Ricky
Definitely great post!
In my opinion, scatter/gather (AKA map/reduce) and locally cached queries (with possibly stale data) are probably the way to go when implementing distributed queries capabilities.
I'm still pretty skeptic about the efficiency of distributed data structures (such as btrees), in particular because of the burden in maintaining them.
@Emil
Interesting insights.
However, I somewhat disagree with your first point regarding "static queries".
What you say is probably true when talking about Neo4j and other products providing rich query APIs (such as, guess what, RDBMS products).
However, when talking about more general NoSQL products, I think the best approach would be to de-normalize your data and provide a "fast" access path to your query: in your example, it would mean to store into your customer "document" related shopping carts data.
My two euro cents,
Cheers,
Sergio B.
Ricky,
Awesome entry. Just to let you know that Hazelcast is in-memory DHT and it has SQL like query and indexing. Here is how indexing and query work: http://code.google.com/docreader/#p=hazelcast&s=hazelcast&t=MapQuery
Good post. You may want to see this paper on the GT.M NoSQL database: http://gradvs1.mgateway.com/download/gtm_4_python.pdf particularly chapter 3 where the standard relational SQL techniques are compared with the equivalent techniques you'd use in GT.M
Surely the lack of support for complex query is dependent of the choice of storage.
The storage / access analogies in NoSQL frameworks indicated in your article range from simple name-value pairs to the virtual "tables" in big-table.
Looking at the GQL query syntax all conditions are limited to facilitating data being retrieved out via a single forward only "table scan".
Surely it would be difficult to model and query complex data without denormalising, duplicating data. Maybe something like big-table is much better suited to read-only operation, not through put.
Certainly would be wanting to use as much of memcache for session or volitile shared data with app-engine.
If you want both to model and query greater complexity perhaps an extension to name value pairs is to go multidimensional.
There are very well established implementations for these, for example the "multi value" community http://en.wikipedia.org/wiki/MultiValue.
GT.M is also a well established open source, schemaless, NoSQL storage engine which has a new "name-value pair" view provided by M/DB. But under the bonnet it is multidimensional.
This would give you the ability to have tables-within tables-within tables, ie: pre-joined. Querying can be done efficiently by what I just call "walking the storage".
Another side-effect of multidimensional is that the storage can be both the storage and the index.
SHARDING
One thing that is really important to any NoSQL query support is how do you deal with sharding.
If you look at SimpleDB on Amazon, a shard can only occupy 10GB.
If you loo at BigTable either your records are in one entity "shard" or in "another"
Cache can scale a logical shard transparently across 4 Terabyte segments. So you are limited to the amount of storage you can physically attach to a node.
GT.M will support a shard in excess of 1 Terabyte.
PARALLELISM
If you data exceeds your shard size then query support needs to provide parallelism across nodes hosting shards
For example parallel querying across shards of the same logical "name-value table" / "entity".
CONSISTENCY
In you article you assert that relaxed data consistency is necessary.
In Google big table the data is consistent, and supports high availability via data redundancy. But has relatively slow insert speed.
In CouchDB consistency is eventual (?)
In Hadoop the consumer node copies data in the processing cycle. Facilitating redundancy on demand.
In GT.M a shard can support both shard atomicity and availability via replication.
In Cache a shard can support consistency across many nodes dependent on proximity and bandwidth (ECP)
So I think it is difficult to generalise on a strategy for query. But would indicate that some NoSQL platforms already facilitate very efficient means of achieving complex query. The difficulty is introducing parallelism across shards and collating results.
Maybe it would be useful to also consider that in a large distributed dataset the complex query may take a long time, so you many want to support asynchronous query pattern, in addition to recoverability, should an operation fail on a small number of shard nodes. Maybe you want queries to partially pause as you undertake some maintenance of one of the data nodes?
Hope this helps
@ricky - Great discussion of the NOSQL scene. We're trying to reconcile a "best of all worlds" approach that brings the "consumability" of SQL to the speed and flexibility of NOSQL.
@emil - I have to disagree strongly on the "static queries" assumption. In fact, if anything, I would argue the polar opposite. Applications need to become much more dynamic and adaptable, without the traditional code/compile/deploy cycle. I am a fan of neo4j's concepts, but really would like to see some activity around APIs for simplifying the querying process. Perhaps a few examples using SPARQL would be helpful?
Bloom filters are not a panacea. To have a reasonably small false positives the size of a filter (in bits) must be at least 8-10* number of keys. It gives us approx 1Gb of filter data for every 1 billion of keys. In many situation it does not make sense unless you key subset is comparable in size to the whole key set or the length of a key is pretty large in average.
Cassandra actually allows you to do range scan queries in some cases, which is extremely helpful. In practice, with range scans, and by manually populating "secondary indexes" you can avoid the need for these kinds of techniques.
Additionally the model given by Cassandra (and some others) can be used to implement various types of many:many relationship without needing to do key range scans - essentially add the foreign keys as "columns" of a single key.
Please check out Bangdb (http://www.iqlect.com) the new key value store
which seems to be very fast in terms of IOPS for both read and write. The
Bangdb will be in many flavors for ex; embedded, network, elastic
cache/imdg. Being crash proof, with many configuration parameters, it can
be tuned to operate in most suitable fashion for a given requirement.
Post a Comment