They should. It’s fast, resilient, and often cheaper than conventional databases. Plus, it’s the backbone of many Web 2.0 sites
NoSQL systems are still in an early phase, but already they’ve played a key role in the development of massive social networking sites Twitter and Facebook as well as social gaming site Farmville. Their speed and reliability cater to a Web world, an appeal that will only grow with the mobile Web.
They don’t pretend to do everything a relational database does, and a mere 5% of companies are using or piloting them, our survey of 755 business technology pros finds. Heck, 44% of survey respondents hadn’t even heard of NoSQL. But we think it’s inevitable that NoSQL will play a role alongside conventional database systems in many enterprises.
In many cases, NoSQL systems are bringing data management techniques into the enterprise that originated to meet the needs of Web businesses such as Google, eBay, and Amazon, which were generating data at rates that choked relational databases.
Not since Edgar Codd, a database pioneer at IBM, issued his “12 rules” of relational databases in the 1980s has the market been in such ferment. Startups are popping up to commercialize the open source NoSQL projects, including Cassandra, MongoDB, and CouchDB. Suppliers of the tools surrounding conventional database systems are getting in on the act as well. For example, Quest Software’s Toad, typically used to automate Oracle, IBM DB2, and Microsoft SQL Server administrative tasks, is now available for Cassandra, an open source project that began at Facebook. Embarcadero Technologies is extending its relational tools to support NoSQL systems. While enterprise use is tiny today, 22% of the IT pros we surveyed say they’re interested but need to learn more.
Once Eric Evans, while a systems developer at Rackspace, used the term NoSQL to describe large, clustered but nonrelational data management systems, this trend had the provocative name it needed to take on “movement” proportions. Projects that use NoSQL approaches include Membase Server, Scalaris, LightCloud, and RavenDB. Pincaster applies the NoSQL approach to a data system for applications using geographical data.
Make no mistake, NoSQL systems aren’t as good as relational databases at many things. Relational systems, with their strictly defined properties, can start a multistep transaction and guarantee the integrity of its data upon completion–no mean feat in a world of rapidly changing data. Relational databases pass the ACID test–atomicity, consistency, isolation, and durability of transactions. NoSQL can’t.
But VoltDB, the latest startup from Michael Stonebraker, points to where these two worlds may meet. VoltDB is a relational database that handles millions of transactions a second and still meets the ACID test by distributing both the database and the data across a server cluster. That’s similar to how the NoSQL Cassandra, Redis, and MongoDB distribute themselves to manage high volumes of unstructured data.
What Defines A NoSQL System
NoSQL is a bit of a misnomer. NoSQL systems can use the SQL data access language. But what NoSQL conveys is that these systems are doing something very different; they’re avoiding going to disks to answer complex queries. Frequently, they have presorted the data and loaded it into the RAM of several servers in a cluster. All the memory across a server cluster is treated as a pool. That structure allows nearly instantaneous access to large amounts of data. Queries are directed by MapReduce-style guidance systems to the data and an associated processor. Governing overhead is kept to a minimum.
NoSQL systems are designed for clusters, so they tolerate hardware failures by keeping backup copies of the data on separate physical servers. When a failure occurs, a query is switched to a backup. In contrast, relational systems scale best when they’re put on a larger and more expensive server. Hardware failure means total system failure, although a mirrored copy might be running nearby with mission-critical systems.
NoSQL systems perform write, delete, and update functions to unstructured data, much as relational systems work on structured tables. They sort, merge, and rearrange data chronologically, or any way Boolean rules allow. They scale out easily on standard x86 hardware. The data in these systems can be accessed by many languages–Erlang, anyone?–and are much more accessible to application developers than relational systems.
Often, NoSQL systems are being built by Web site developers or cloud service providers, not by database architects and administrators. A change in data may take place after a new request for information has come in, but the update takes place “eventually.”
What NoSQL lacks in precision it makes up for in speed. Companies facing impatient Web customers have a fleeting moment to exchange information, and NoSQL systems are meant to sift many terabytes of stored data to provide useful information almost instantly. Cassandra came out of Facebook in 2008, after Facebook had outgrown MySQL.
Companies Behind NoSQL
These properties make it ideal for mobile devices synchronized to a central server, says Damien Katz, the lead developer of CouchDB and CEO of CouchOne, a 14-person startup that sells products based on CouchDB. It has a small footprint and uses a peer-to-peer system of replication–the remote and central systems frequently check with each other to see they’re in sync. Either may update the other.
Membase Server is a recent NoSQL addition from NorthScale, managing key values that tell a system where to look for a particular type of data. Membase Server uses the Memcached data management system, an open source system for scaling applications, and it converts server RAM in a cluster into a shared pool, managing both data and business logic in RAM. Membase is one of the systems behind Xynga’s blockbuster Farmville game built for Facebook users.
Xynga uses relational databases for its monetary transactions and virtual asset purchasing, says NorthScale co-founder James Phillips. It uses a NoSQL system to deal with millions of online game players simultaneously.
Home Depot is evaluating Membase for use in product presentations on its Web site, says Phillips, and JPMorgan Chase is evaluating it for its multitenant private cloud environment. “Many people want to use Membase as a place to put lots of data without first having to build a schema,” Phillips says. With Membase’s new TAP interface, users may store data in the NoSQL system, then selectively extract, transform, and load it into a relational system. TAP is a replication, querying, and indexing interface, and it’s also used to rebalance Membase across nodes when adding servers.
Like Xynga, many enterprises will find a use for a NoSQL system to handle massive amounts of new digital data, while a SQL system executes well defined financial transactions. “A lot of NoSQL proponents act as if it’s the only technology choice, and I think that’s wrong,” says Phillips.
MongoDB is an open source NoSQL document database used by, among others, location-based social network Foursquare. Foursquare had 100 million “check-ins” in July.
MongoDB’s latest version added auto-sharding, the ability to partition unstructured data across a large server cluster, then route queries to the right partition or shard using a key, which is stored on a central database server. Partitioning relational systems requires embedding knowledge of the partitions in the database application, and if the partitions change, the application must be changed as well. Systems like MongoDB handle that outside the application and without human intervention.
MongoDB can store many types of data, and developers may use the data without knowing a lot about how to access it or the nature of the data itself, says Dwight Merriman, CEO of 10Gen, which sells support to MongoDB users. At Foursquare, MongoDB lets developers use geographic positioning coordinates as just another form of data.
Cassandra is something of a star among NoSQL systems, a proven key-value store system originally developed at Facebook after Facebook’s implementation of MySQL hit its limits. Facebook donated the Cassandra code as an open source project in 2008, and early this year it became a full-fledged Apache project. Cassandra borrowed its approach to column sets from Google’s BigTable, a predecessor nonrelational system.
In key-value store systems, semistructured data is stored as an object in a column, with each column having its own key reference. Like BigTable, Cassandra keeps related information close together by grouping related columns into “families.” This structure simplifies some mapping issues when spreading information over a server cluster and speeds lookups.
Cassandra is used by other social networking sites, including Twitter, Reddit, Cloudkick, and Digg. All need to access large amounts of information quickly and serve them to end users. Updates occur with less urgency. The system has various failover techniques to avoid any single point of failure due to a hardware stoppage. Jonathan Ellis, a former Rackspace software architect, is the lead of the Apache project and CTO of a new firm, Riptano, established in April to provide commercial support for Cassandra.
Technically, Cassandra, MongoDB, and other NoSQL systems lack the ability to do two big things relational systems can do: perform joins of data from different tables, forming a new view of related data; and process complex transactions with constant data integrity. Financially, the companies behind these systems are small and inexperienced. There will be a lot of startup competition, and many failures, through which early adopters will have to live.
Yet NoSQL’s emergence testifies to a need the relational model isn’t meeting. The Web is generating data so fast that systems must swap in and out hundreds of terabytes at a time. Only distributed NoSQL systems look ready for the job.