Visit Your Local PBS Station PBS Home PBS Home Programs A-Z TV Schedules Watch Video Donate Shop PBS Search PBS
I, Cringely - The Survival of the Nerdiest with Robert X. Cringely
Search I,Cringely:

The Pulpit
The Pulpit

<< [ The Cringely Plan ]   |  Data Debasement  |   >>

Weekly Column

Data Debasement: Cloud computing will change the way we look at databases.

Status: [OPEN] comments (47) | add a comment
By Robert X. Cringely
bob@cringely.com

The Pulpit Poll

Do you have an application you are expecting to eventually move to a cloud?

Yes
No

Skip this one and see results

Last week I was in Boston to moderate a panel at the MIT Technology Review's Emerging Technologies Conference -- one of those tech shindigs so expensive I can only attend as hired help. My panel was on parallel computing and it produced this column and another I'll file early next week. This week is about databases and next week is about threads. Isn't this a grand time to be a nerd?

Thanks in part to Larry Ellison's hard work and rapacious libido, databases are to be found everywhere. They lie at the bottom of most web applications and in nearly every bit of business software. If your web site uses dynamic content, you need a database. If you run SAP or any ERP or CRM application, you need a database. We're all using databases all the time, whether we actually have one installed on our personal computers or not.

But that's about to change.

We're entering the age of cloud computing, remember? And clouds, it turns out, don't like databases, at least not as they have traditionally been used.

This fact came out in my EmTech panel and all the experts onstage with me nodded sagely as my mind reeled. No database?

No database.

Parallel computing used to mean scientific computing, where hundreds or thousands of processors were thrown at technical problems in order to solve them faster than Moore's Law might otherwise have allowed. The rest of us were relying on rising clock rates for our performance fix, but scientists -- scientists with money -- couldn't wait so they came up with the idea of using multiple CPUs to solve problems that were divided into tasks done in parallel then glued back together into a final result. Parallel computing wasn't easy, but sometimes that was the whole point -- to do it simply because it was so difficult. Which is probably why parallel computing remained a small industry until quite recently.

What changed was Moore's Law put an end to the clock rate war because chips were simply getting too hot. While faster and faster chips had for the most part linear performance increases along with cost and power consumption decreases, the core temperature inside each microprocessor chip was going up at a cubic rate. Back in 2004 Intel released a chart showing that any clock speed over 5 GHz was likely to melt silicon and Moore's Law would, by 2010, make internal processor temperatures similar to those on the surface of the Sun!

For those, including me, who think that's pretty darned hot I'll point out one of my astronomer readers immediately had to mention that the Sun's chromosphere is actually much hotter than the surface. Forgive him, he means well.

Faced with this absolute thermal performance barrier, Intel and AMD and all the other processor companies had to give up incessant clock speed increases and get us to buy new stuff by putting more than one CPU core in each processor chip can. Now chips with two and four processor cores are common and Intel hints darkly that we'll eventually see hundreds of cores per chip, which brings us right back into the 1970s and '80s and the world of parallel computing, where all those principles that seemed to have no real application are becoming very applicable, indeed.

And that's exactly where databases start to screw up.

Bob Lozano, chief visionary, evangelist, father-of-eight (same woman) at Appistry came up with the first database example I'd heard and it was eye opening. Appistry (I've written about them before -- it's in the links) specializes in distributing what would normally be mainframe applications across tens, hundreds, or even thousands of commodity computers that act as one. If a government agency wanted to do meter-scale analysis of very high-resolution images from spy satellites, it might use Appistry, I'd imagine (just a guess, of course).

The database example Bob used at MIT was of an unnamed customer that is a credit card transaction processor. Their core application -- the processing of credit card transactions -- was happening on IBM Z-series mainframes (the biggest of big iron) and the client wanted to port the whole mess to commodity PCs.

Google did it, why not them?

But the first time they tried it at Appistry, it didn't work.

"We hired a technical team that had done similar applications before so they started with replicating the mainframe architecture on the commodity computers, database and all," Lozano recalled. "But when they finished it wasn't appreciably faster than the mainframe it replaced. Even worse, it wouldn't scale. So we fired the team and started over."

The problem, it turned out, was in the database.

The way the original mainframe application functioned was by first receiving transactions, writing them to the database, reading them from the database, then doing the actual processing before writing them again back to the database. That's read-write-read-process-write.

The second time through the Appistry team tossed the database, at least for its duties as a processing platform, instead keeping the transaction -- in fact ALL transactions -- in memory at the same time. This made the work flow into read-process-write (eventually). The database became more of an archive and suddenly a dozen commodity PCs could do the work of one Z-Series mainframe, saving a lot of power and money along the way.

If this sounds like a risky way to do business -- not writing the data to disk until things slow up enough -- remember that's the way Google runs its search engine and why it is so darned fast. Google has THE ENTIRE INTERNET IN MEMORY AT ONCE. If the application slows down they just add more hardware.

This is good news for cloud computing and bad news for mainframes, because systems like Appistry and its competitors (there are several) are going to eventually bury the mainframe by putting the "cloud" into cloud computing. Suddenly a Storage Area Network with a relatively weak database controller is good enough for archiving while the parallel or even massively parallel cloud does the real work.

Later this month Microsoft will announce a cloud version of Windows Server and it will be very interesting to see how it handles database integration and dis-integration.

The database problem is much more than just slow reads and writes. Relational databases also create false dependencies between pieces of data. Dependencies of any kind break parallelism, and therefore make an application hostile to commodity platforms. That is, if one chunk of data (A) is dependent on another chunk of data (B), then no work can be done on A until all work on B is complete. If the dependency is real, like when A and B are both withdrawals from the same bank account, then there are hacks we can try like one I will describe in my next column, but most programmers just choose to have a cup of coffee and wait for B to finish.

But if these are withdrawals from different bank accounts, or maybe even different banks, then no true dependencies exist. Unfortunately, if all of this has been stored in a single relational database then we unintentionally create a false dependency, since that database can only handle a fairly limited amount of items concurrently -- we've created a bottleneck that will choke the application.

While the database guys are busy figuring out how to add more and more concurrency internally, in reality when you take a few steps back and think of a large set of commodity boxes all executing a single data munching app, then no matter how sophisticated we get, the relational database will still effectively be a single thread to that app.

A traditional response is to pour dollars into the data tier, buy faster, more concurrent SANs, better interconnects, and bigger database servers. That works up to a point and pleases Sun and IBM no end.

Somewhat more helpful though are data grid products like Coherence or eXtreme Scale (IBM), or Appistry's Fabric Accessible Memory (FAM). For many applications those can take more of the lifetime of each chunk of data and keep it in memory in the middle tier -- hopefully on the same boxes where this large data-munching app resides. But this still doesn't completely solve the problem because there remain limits on how far the relational database can scale.

Here's how Google attacks this problem, which goes beyond simply keeping the entire Internet in memory. The problem, of course, is that you can't keep the entire Internet in memory in every server because then you'd need more memory chips than even Google can afford to buy.

To scale the Google search service, then, they figured that many large problems did not intrinsically require doing actions one at a time. But Google first had to free itself of the false dependencies. So they coined the term MapReduce and created both a set of operations and a way to store the data for those operations natively, all while preserving the natural independence that is inherent in each problem, building the whole mess atop the remarkable Google File System, which I'll cover some other day.

Google led the way but many other companies have followed suit, opening doors to a wide range of new ways of thinking about large-scale data manipulation. Suddenly there are different ways to store the data, new ways to write applications, and new places (thousands of cheap boxes) to run such applications.

What this does for Larry Ellison and his libido is a great question, because it looks like he's bought up most of the traditional database-centric software industry just in time for it to be declared obsolete.

Sorry Larry.

Comments from the Tribe

Status: [OPEN] read all comments (47) | add a comment

Exadata.

J Peters | Oct 09, 2008 | 10:52AM

Exadata.

J Peters | Oct 09, 2008 | 10:54AM

Encyclopedias used to be a shelf full of books. Along came the CD-ROM (and DVD later) and that shelf vanished. But the rest of the books stayed.

For the Googles, Amazons and EBays, "data" was too cumbersome for the old database paradigm. Vast quantities of data, lots of concurrent users, minimal consistency issues. They used and developed new technologies.

But the vast majority of existing systems work the way they are, and won't need or benefit from any revolutionary changes in that super-scale field.

Gary | Oct 10, 2008 | 5:29PM
[OPEN] read all comments (47)

ADD A COMMENT

Ground rules for posting comments...

  1. No profanity or personal attacks, please.
  2. Please restrict your comments to the subject of the column and directly relevant topics.
  3. Be more funny.
  4. Those who violate these ground rules will have their comments deposted (which is a bit like being deported, only you don't have to leave the country).

name:

e-mail:

NOTE: Your email address is for internal purposes only and will not be published, shared or sold to other entities.

url (optional):

Comment (br and p tags are not necessary for line breaks)