Monday, February 2, 2009

Web 2.0 Meets Database 101

As of this moment, the popular social bookmarking website Ma.gnolia has been off the air for almost three days because of "data corruption and loss". When (if?) Ma.gnolia comes back the data may be gone forever.

This is a life-critical event for an enterprise defined by the data it stores. Founder Larry Halff's apology even seems to stutter ("to to"):

"I can't provide a certain timeline or prognosis as to to when or to what degree Ma.gnolia or your bookmarks will return"


Initially, the key words are to what degree. As time marches on when will become more important, perhaps replaced by who cares? as users flock to delicious.com (formerly known as del.icio.us, get it?).

According to Wired, "Ma.gnolia posted a short note on its website shortly after 9 a.m. Pacific time (January 30), saying it was down temporarily due to a database failure."

Huh?

A "database failure"? OK, it's time for a FAQ:

Q: Why should I use a database?

A: Because a database server automatically recovers after a system crash. Because a database backup is an internally consistent copy of all your data.

Q: Why should I take a backup of my database?

A: Because computer files sometimes get lost or corrupted. Because disk drives sometimes stop working. Because sometimes, whole computer systems are destroyed. Along with the buildings they are in.

Q: How do I back up my database?

A: If you're talking about SQL Anywhere, shut down the database server and copy the .DB file to another location. On another disk drive, on another computer, maybe even in another building.

Q: How do I restore my database?

A: Copy the backup .DB file back to its original location and start the SQL Anywhere database server.

Q: Why haven't you mentioned the relational model, SQL, ACID, online backup or database mirroring?

A: Because this is Database 101. Databases like SQL Anywhere have all that other stuff, but the first things first: ya gotta have a plan.

Q: What was that about online backup?

A: The dbbackup utility lets you back up a running database to a disk drive anywhere in the world. It can also do a "live backup" by continuously shipping transaction log records to that other disk drive.

Q: What is database mirroring?

A: It's really called "High Availability" where you have two complete database servers with two matching copies of the database in two different locations. When one server fails all the users are switched over to the other one, and all their data is safe.

But does anyone care?

It may be a moot point now, but Gnolia Systems (Ma.gnolia, get it?) had previously announced an ambitious project to redevelop the Ma.gnolia software. The Ma.gnolia 2 Charter is most interesting for what's missing, not what it contains.

For example, it mentions technical terms like Ruby on Rails, OPENID, OAUTH and a "new API method: user_find", but none of the following words appear anywhere in the 12 pages:
  • database
  • datastore
  • file
  • backup
  • recovery
  • restore
  • integrity
  • availability
  • strategy
  • infrastructure
These words do appear, but not in the context of basic data storage and protection:
  • available
  • data
  • safety
  • protect, protection
  • risk

Can you hear me now?

Yeah, sure, I'm piling on, kicking Ma.gnolia when it's down, but that's not the point.

The point is, WHO ELSE OUT THERE DOESN'T GET IT?

Larry Halff and his team are paying for their complete lack of foresight. They will never forget this past weekend for the rest of their lives. But Ma.gnolia doesn't matter, what matters is YOUR DATA.

Gosh, do ya think?

So I'm gonna pile on some more, with this excerpt from the "Why M2?" section of the Ma.gnolia 2 Charter:
"A major re-design is required to truly take advantage of lessons learned over 3 years. These issues range across identity, reputation, spam, privacy and contact management, cross-service presence, operational costs and the personal and organizational goals that customers bring to a social bookmarking service."
Gosh, do ya think it's time to add something to the "lessons learned"?

31 comments:

Anonymous said...

Interesting - only a few weeks ago from slashdot http://hardware.slashdot.org/article.pl?sid=09%2F01%2F02%2F1546214&from=rss

"Journalspace.com has fallen and can't get up. The post on their site describes how their entire database was overwritten through either some inconceivable OS or application bug, or more likely a malicious act. Regardless of how the data was lost, their undoing appears to have been that they treated drive mirroring as a backup and have now paid the ultimate price for not having point-in-time backups of the data that was their business."

I hope the saying "bad things come in threes" isn't true in this case

Anonymous said...

I'm curious how you know that this is an issue with their database backup process (or lack of)? Couldnt it just be that their hardware failed and theyre trying to get it fixed?

Breck Carter said...

Three days to fix a hardware problem? Could be, but I don't think so.

The Wired article says "The service lost both its primary store of user data, as well as its backup."

All of which is absolutely unforgivable... under no circumstances should any hardware or software failure cause an outage of this duration.

Not for a real company.

Breck Carter said...

It's not looking good: "Simultaneously, working on tools to help members recover bookmarks from other sources on the web, if necessary."

http://twitter.com/magnolia/

3 days is a long time to go without sleep, the chances of more mistakes rise exponentially.

Breck Carter said...

It's *really* not looking good, the website is off the air... although ping still works:

Pinging ma.gnolia.com [216.93.189.58] with 32 bytes of data:

Reply from 216.93.189.58: bytes=32 time=80ms TTL=47
...
Ping statistics for 216.93.189.58:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 80ms, Maximum = 81ms, Average = 80ms

philipp said...

it doesn't matter if the webserver isn't responding for some time. what does matter is if they're able to recover my bookmarks.
i can live with lost tags, lost ratings, lost thumbnails and lost site-copies, but i do need my bookmarks.

it would be great if the guys over at ma.gnolia could post some more status updates (e.g. "bookmarks are safe, still working on the tags" or sth.)

Breck Carter said...

philipp: See http://twitter.com/magnolia/

They just posted this a few minutes ago: "Still working on data store. In meantime, FriendFeed recovery is available here: http://tr.im/ed0u"

IMO twitter is fine but they REALLY need to put that on their (still absent) home page... they can host a single-page website anywhere in the world :)

Breck Carter said...

Wired is excited about Ma.gnolia's clever approach to getting users to recover their own data from the cloud.

http://blog.wired.com/business/2009/02/magnolia-using.html

I guess we're all Thermians.

http://us.imdb.com/character/ch0008369/

Shirley said...

Greeting.

I think Microsoft Office SharePoint Server 2007 is something you definitely want to look at. We specialise in this.

There is more information on this at http://www.nsynergy.com or please mail to info@nsynergy.com.

Breck Carter said...

Shirley: I am sure SharePoint is a fine product but this article is about backups, database backups in particular, and that's not what SharePoint is all about... nor, apparently, is that what nSynergy is about. E.g., this Google search yields no hits:

database backup site:nsynergy.com

To be specific, SharePoint is something that *needs to be backed up*, not something that helps you *do a backup*:

http://technet.microsoft.com/en-us/library/cc262412.aspx

Anonymous said...

OK, one thing that doesn't seem to be said here. Don't rely on live replication as a backup mechanism. Live replication of a corrupted database gives you.... two corrupted databases. Alway have some OFFLINE backup that can be corrupted real time.

Anonymous said...

Correction

Replace "backup that can be corrupted real time"
with "backup that cannot be corrupted real time"

Breck Carter said...

Anonymous: Yes, it should be said but it wasn't. I did mention "mirroring" in passing and shouldn't have done that without your exhortation. One might quibble that it isn't necessary in a "Database 101" article but that isn't true today: with cheap computers and cheap disk drives and cheap networks, high availability / mirroring / live replication is really affordable (and really really easy with SQL Anywhere, but that's neither here nor there)... and so many organizations DO rely on it for backup purposes. It is sooooo tempting.

Also missing is "validate the backup"... no quibbling there, that SHOULD be mentioned even in a Database 101 article. I use the "Not Enough Coffee" defense :)

The next article (The Tao Of Backup) does provide a pointer to a *thorough* discussion.

Breck Carter said...

Sooooo... it looks like Ma.gnolia is a two-person operation, has never been more than 4 people. That's interesting, but irrelevant when it comes to backup IMO.

http://getsatisfaction.com/magnolia/topics/bookmark_recovery_tips

Breck Carter said...

Correction: Apparently it's a one-person operation.

Anonymous said...

Breck where did you pick up on that? I recall hearing somewhere that Gnolia systems had a bigger team than that.

Anonymous said...

oooh ok I just saw Todd's comment. NVM

Sean said...

The Ma.gnolia team is quite small. Mostly two guys who have had various bits of help along the way.

Anonymous said...

I'm a one person team re:backups and everything is backed up overnight every day! and, the backups are tested regularly to be sure we can get them back.

And this is not on a machine in the same locale (or, heaven forbid, backup A to B, B to A, C to D, D to C, though even that would have been better than what we're seeing from gnolia)

Not bright at all (what does this say about cloud computing, btw?)

Anonymous said...

I might be able to see how this could happen. When Larry started Ma.gnolia, backing up might have been a simple mysqldump at midnight to another server in the colo.

When traffic and data increased, the dump might have taken longer and longer until which the mysqldump no longer made sense but the "next step" up might have been cost prohibitive... so they started to use hacks to work around their problem and it so happened the hacks failed.

Breck, what do you think?

Anonymous said...

Larry just posted some info on the situation...

As I expected....apparently they have 500GB's to back up. Its not trivial to backup that kind of data on a live system - sure experienced sys admins probably know how to do that, but for a two man team, the amount of data probably got ahead of them.

http://getsatisfaction.com/magnolia/topics/ma_gnolia_data_recovery_status

brista said...

Well, I for one did not know that this site was run completely by some guy in his mom's garage. So even if it comes back (which I doubt), I won't be using it any longer.

Breck Carter said...

Anonymous asks what I think about "When Larry started Ma.gnolia, backing up might have been a simple mysqldump at midnight to another server in the colo."

My article was aimed more at system developers who, for many different reasons, might be risking the same disaster that has befallen Ma.gnolia. There are two big points here, "backup" being the one getting all the attention.

The other question is "Did Ma.gnolia use a database at all?" As far as I am concerned certain configurations of MySQL do not qualify as databases at all in the enterprise sense of the word... and 500GB makes Ma.gnolia's data store larger than the vast majority of enterprise data stores (for every eBay there are 1,000 "ordinary" enterprises).

Other people have suggested to me that Ma.gnolia may have used SimpleDB on Amazon EC2 due to "the simplicity of their application". Now, those "other people" are folks I respect tremendously, BUT just because a user interface appears to be simple does not mean the underlying data store is simple as well.

So, in spite of me slagging Ma.gnolia in the article, I cannot say "this is exactly what they did wrong", only that "they did something wrong". Police have stopped using the term "accident" to describe automobile collisions because when two cars smack into each other it's (almost) always someone's fault. From what little we know about Ma.gnolia situation, the disaster was the direct result of decisions made (or not made) by those responsible (ok, by Larry).

- continued -

Breck Carter said...

What we do know now, is that the data store was 500GB. That's a reason a database *should* be used, that a backup strategy *should* be carefully considered, not the other way around. Surely, somewhere in Cloud City, Larry could have found someone to advise him. (But wait, that's pure speculation, we really don't know the details).

What I can say with certainty is that "500GB" makes the Ma.gnolia 2 Charter ( http://ma.gnolia.org/docs/M2_Charter.pdf ) even MORE amazing: it mentions Ruby on Rails but not any of these terms:

* database
* datastore
* file
* backup
* recovery
* restore
* integrity
* availability
* strategy
* infrastructure

- continued -

Breck Carter said...

One might ask, "Why is Breck so interested in picking at the entrails of Ma.gnolia?" Because I was originally trained as an Engineer, as in the professional (bridge building etc) sense of the word. Engineers study engineering mistakes, that's part of what they do. Think NTSB, they do the same thing with airplane incidents, and believe me, NTSB reports have a *wide* following. The study of mistakes is one of the reasons airplanes and bridges are so safe today. Mistakes continue to be made, so the study must continue; think I35W in Minneapolis.

On a more personal note, Larry says he is "currently working with a data recovery company in hopes that they can recover a working version of the database". I have been there, done that, spent the enormous sums required to get prompt service, and (re-)learned a valuable lesson. Personally, I have a hard time reading The Tao Of Backup because it creeps me out; not the cheesy Tao thing but the examples that are so real, so very possible. So I wish Larry well, and hope that someday he will share the details, but I will understand if he does not.

Anonymous said...

One thing that always bothers me about these discussions, the word "backup".

I think the term pollutes our thinking. Any time I have been tasked to write a backup plan I politely tell folks I don't write backup plans, I write restoration plans. Without restoration backup is irrelevant. I have seen many "backup" plans that dutifully capture all the data while having a restoration process that requires vast amounts of labor and time. How many times have we seen requirements that say backup must happen every day, we can't lose more than X hours of data, but are totally silent on any SLA around restore. I write restoration plans.

Breck Carter said...

It's been over a week since Ma.gnolia disappeared, and the "who cares?" moment is fast approaching... if it isn't already in the past.

"Thanks to Larry's recommendation of Diigo, I've started using it and wish that I'd found it earlier. ... I don't have any reason to care whether Magnolia lives or dies any more beyond recovering my bookmarks that disappeared with Magnolia." GetSatisfaction

Breck Carter said...

Another week has passed... and for another view of the situation see Backup Policies Can Really Save Businesses.

Breck Carter said...

Oops, I should have linked to the original article here: Backup Policies Can Really Save Businesses.

At the bottom of *that* page is a particularly poignant link: "Social bookmark this page" :)

Breck Carter said...

Posted on Ma.gnolia.com: Update (Friday, February 13, 2009, 7:00 PM PST): The data recovery folks let me know that they're still work, but I should hear more from them by Tuesday.

ehsanul said...

As a non-magnolia user, your post seems overly critical, though obviously the points you make are perfectly valid. Maybe it seems to me that way because I just watched the following video, and the guy seemed pretty alright, and had simply been clueless about how to properly backup his database (which is his fault of course, but I still think he's alright):

http://factoryjoe.com/blog/2009/02/16/what-really-happened-at-magnolia-and-lessons-learned/