Visit Your Local PBS Station PBS Home PBS Home Programs A-Z TV Schedules Watch Video Donate Shop PBS Search PBS
I, Cringely - The Survival of the Nerdiest with Robert X. Cringely
Search I,Cringely:

The Pulpit
The Pulpit

<< [ The Cringely Plan ]   |  Data Debasement  |   [ Off With Their Heads! ] >>

Weekly Column

Data Debasement: Cloud computing will change the way we look at databases.

Status: [CLOSED] comments (48)
By Robert X. Cringely

Last week I was in Boston to moderate a panel at the MIT Technology Review’s Emerging Technologies Conference — one of those tech shindigs so expensive I can only attend as hired help. My panel was on parallel computing and it produced this column and another I’ll file early next week. This week is about databases and next week is about threads. Isn’t this a grand time to be a nerd?

Thanks in part to Larry Ellison’s hard work and rapacious libido, databases are to be found everywhere. They lie at the bottom of most web applications and in nearly every bit of business software. If your web site uses dynamic content, you need a database. If you run SAP or any ERP or CRM application, you need a database. We’re all using databases all the time, whether we actually have one installed on our personal computers or not.

But that’s about to change.

We’re entering the age of cloud computing, remember? And clouds, it turns out, don’t like databases, at least not as they have traditionally been used.

This fact came out in my EmTech panel and all the experts onstage with me nodded sagely as my mind reeled. No database?

No database.

Parallel computing used to mean scientific computing, where hundreds or thousands of processors were thrown at technical problems in order to solve them faster than Moore’s Law might otherwise have allowed. The rest of us were relying on rising clock rates for our performance fix, but scientists — scientists with money — couldn’t wait so they came up with the idea of using multiple CPUs to solve problems that were divided into tasks done in parallel then glued back together into a final result. Parallel computing wasn’t easy, but sometimes that was the whole point — to do it simply because it was so difficult. Which is probably why parallel computing remained a small industry until quite recently.

What changed was Moore’s Law put an end to the clock rate war because chips were simply getting too hot. While faster and faster chips had for the most part linear performance increases along with cost and power consumption decreases, the core temperature inside each microprocessor chip was going up at a cubic rate. Back in 2004 Intel released a chart showing that any clock speed over 5 GHz was likely to melt silicon and Moore’s Law would, by 2010, make internal processor temperatures similar to those on the surface of the Sun!

For those, including me, who think that’s pretty darned hot I’ll point out one of my astronomer readers immediately had to mention that the Sun’s chromosphere is actually much hotter than the surface. Forgive him, he means well.

Faced with this absolute thermal performance barrier, Intel and AMD and all the other processor companies had to give up incessant clock speed increases and get us to buy new stuff by putting more than one CPU core in each processor chip can. Now chips with two and four processor cores are common and Intel hints darkly that we’ll eventually see hundreds of cores per chip, which brings us right back into the 1970s and ’80s and the world of parallel computing, where all those principles that seemed to have no real application are becoming very applicable, indeed.

And that’s exactly where databases start to screw up.

Bob Lozano, chief visionary, evangelist, father-of-eight (same woman) at Appistry came up with the first database example I’d heard and it was eye opening. Appistry (I’ve written about them before — it’s in the links) specializes in distributing what would normally be mainframe applications across tens, hundreds, or even thousands of commodity computers that act as one. If a government agency wanted to do meter-scale analysis of very high-resolution images from spy satellites, it might use Appistry, I’d imagine (just a guess, of course).

The database example Bob used at MIT was of an unnamed customer that is a credit card transaction processor. Their core application — the processing of credit card transactions — was happening on IBM Z-series mainframes (the biggest of big iron) and the client wanted to port the whole mess to commodity PCs.

Google did it, why not them?

But the first time they tried it at Appistry, it didn’t work.

“We hired a technical team that had done similar applications before so they started with replicating the mainframe architecture on the commodity computers, database and all,” Lozano recalled. “But when they finished it wasn’t appreciably faster than the mainframe it replaced. Even worse, it wouldn’t scale. So we fired the team and started over.”

The problem, it turned out, was in the database.

The way the original mainframe application functioned was by first receiving transactions, writing them to the database, reading them from the database, then doing the actual processing before writing them again back to the database. That’s read-write-read-process-write.

The second time through the Appistry team tossed the database, at least for its duties as a processing platform, instead keeping the transaction — in fact ALL transactions — in memory at the same time. This made the work flow into read-process-write (eventually). The database became more of an archive and suddenly a dozen commodity PCs could do the work of one Z-Series mainframe, saving a lot of power and money along the way.

If this sounds like a risky way to do business — not writing the data to disk until things slow up enough — remember that’s the way Google runs its search engine and why it is so darned fast. Google has THE ENTIRE INTERNET IN MEMORY AT ONCE. If the application slows down they just add more hardware.

This is good news for cloud computing and bad news for mainframes, because systems like Appistry and its competitors (there are several) are going to eventually bury the mainframe by putting the “cloud” into cloud computing. Suddenly a Storage Area Network with a relatively weak database controller is good enough for archiving while the parallel or even massively parallel cloud does the real work.

Later this month Microsoft will announce a cloud version of Windows Server and it will be very interesting to see how it handles database integration and dis-integration.

The database problem is much more than just slow reads and writes. Relational databases also create false dependencies between pieces of data. Dependencies of any kind break parallelism, and therefore make an application hostile to commodity platforms. That is, if one chunk of data (A) is dependent on another chunk of data (B), then no work can be done on A until all work on B is complete. If the dependency is real, like when A and B are both withdrawals from the same bank account, then there are hacks we can try like one I will describe in my next column, but most programmers just choose to have a cup of coffee and wait for B to finish.

But if these are withdrawals from different bank accounts, or maybe even different banks, then no true dependencies exist. Unfortunately, if all of this has been stored in a single relational database then we unintentionally create a false dependency, since that database can only handle a fairly limited amount of items concurrently — we’ve created a bottleneck that will choke the application.

While the database guys are busy figuring out how to add more and more concurrency internally, in reality when you take a few steps back and think of a large set of commodity boxes all executing a single data munching app, then no matter how sophisticated we get, the relational database will still effectively be a single thread to that app.

A traditional response is to pour dollars into the data tier, buy faster, more concurrent SANs, better interconnects, and bigger database servers. That works up to a point and pleases Sun and IBM no end.

Somewhat more helpful though are data grid products like Coherence or eXtreme Scale (IBM), or Appistry’s Fabric Accessible Memory (FAM). For many applications those can take more of the lifetime of each chunk of data and keep it in memory in the middle tier — hopefully on the same boxes where this large data-munching app resides. But this still doesn’t completely solve the problem because there remain limits on how far the relational database can scale.

Here’s how Google attacks this problem, which goes beyond simply keeping the entire Internet in memory. The problem, of course, is that you can’t keep the entire Internet in memory in every server because then you’d need more memory chips than even Google can afford to buy.

To scale the Google search service, then, they figured that many large problems did not intrinsically require doing actions one at a time. But Google first had to free itself of the false dependencies. So they coined the term MapReduce and created both a set of operations and a way to store the data for those operations natively, all while preserving the natural independence that is inherent in each problem, building the whole mess atop the remarkable Google File System, which I’ll cover some other day.

Google led the way but many other companies have followed suit, opening doors to a wide range of new ways of thinking about large-scale data manipulation. Suddenly there are different ways to store the data, new ways to write applications, and new places (thousands of cheap boxes) to run such applications.

What this does for Larry Ellison and his libido is a great question, because it looks like he’s bought up most of the traditional database-centric software industry just in time for it to be declared obsolete.

Sorry Larry.

Comments from the Tribe

Status: [CLOSED] read all comments (48)

Think of the overhead and the layers of cruft needed to make a web app with a data store back-end...

How about HTML->[language]->database? Hardly 'cruft', especially since the database can guarantee integrity so different languages and front-ends can talk to it without risking corruption.

We never question this assumption... but, I think the ugly duckling is clearly SQL - we need to throw that away, rather than extending it

Here I find myself in agreement, but for completely different reasons. SQL is a *bad* implementation of relational logic, and better languages have been designed. The key is not the language but the logic. No matter how you implement it, if you want to be able to make set-based constraints and relationships, you need something 'relational'.

Replace it with a legit modern programming language [be it python/arc/lisp/ocaml/ruby/F#...] which has succinct, readable ways to express operations on data. Have an event mechanism inbuilt into the language - you wont need to have 12 flavours of stored procedure dialects just to run some code when data changes, or new data arrives.

There is a better way.

And here we have the complete face-palm moment. "Succinct... readable..."; how much more succinct can you get than "SELECT a FROM b WHERE condition"? (OK, some of the non-SQL relational languages let you leave out the "SELECT"...). Simply put, again: it's NOT about the language, it's about whether you need to enforce certain logic on your data. If you did that with Ruby, you would have... a Relational system--albeit an insanely slower one than the heavily-researched and optimized ones you have nowadays.

Rick Morris | Oct 10, 2008 | 3:09PM

The biggest fallacy I see being discussed here is what C.J. Date and colleagues refer to as the logical/physical confusion.

The "Relational Model" is a term originally meant to describe a theory of working with data wherein the user or developer can access and control data simply by making declarative logical statements. There is nothing in that theory that requires a certain programming model to the underlying engine, or even whether the data need be written to disk. Those are questions of implementation. There is no instruction as to whether threads are allowed. In fact, the 'transaction' is not really a part of the relational model. Those are all just issues of what mechanisms the database engine programmers use to achieve the logical goals of the model. Ditto for the choice of language.

The only intent of the relational model was to provide a set of axioms to create a data management engine which a) provides a consistent logical method of dealing with data, and b) can guarantee whatever level of data integrity is desired by the user (example: don't allow an invoice to include items that are not in stock, etc...). How the DBMS accomplishes this goal is up to the developer. Ergo there is no need to foresee the death of the relational database, but rather to look for the interesting methods used to meet new challenges.

Rick Morris | Oct 10, 2008 | 3:45PM

Bob, thanks for the explanation. I think you need to be a bit more explicit in that 3rd to last paragraph.

Dave | Oct 10, 2008 | 5:29PM