Thursday, September 1, 2011

The Thursday Quote - Kinshuman Kinshumann

"The (Windows Error Reporting) service employs approximately 60 servers provisioned to process well over 100 million error reports per day. "
Debugging in the (Very) Large: Ten Years of Implementation and Experience by Kinshuman Kinshumann, Kirk Glerum, Steve Greenberg, Gabriel Aul, Vince Orgovan, Greg Nichols, David Grant, Gretchen Loihle, Galen Hunt from Communications of the ACM July 2011

What Kinshuman is talking about are those "E.T. call home" moments when Windows calls the Mother Ship to report that something has gone wrong. No, there's nobody sitting at Microsoft waiting to handle your error report, but you knew that already.

What you probably didn't know, and I sure didn't know, is that Microsoft does do something with those error reports, and it does get a lot of them. Not 100 million a day, that's the planned capacity, but a lot nonetheless; here are two excerpts from the article:

WER is the first system to provide automatic error diagnosis, the first to use progressive data collection to reduce overheads, and the first to automatically direct users to available fixes based on automated error diagnosis. WER remains unique in four aspects:

1. WER is the largest automated error-reporting system in existence. Approximately one billion computers run WER client code: every Windows system since Windows XP.

2. WER automates the collection of additional client-side data for hard-to-debug problems. When initial error reports provide insufficient data to debug a problem, programmers can request that WER collect more data in future error reports including: broader memory dumps, environment data, log files, and program settings.

3. WER automatically directs users to solutions for corrected errors. For example, 47% of kernel crash reports result in a direction to an appropriate software update or work around.

4. WER is general purpose. It is used for operating systems and applications, by Microsoft and non-Microsoft programmers. WER collects error reports for crashes, non-fatal assertion failures, hangs, setup failures, abnormal executions, and hardware failures.

. . .

WER collected its first million error reports within 8 months of its deployment in 1999. Since then, WER has collected billions more. The WER service employs approximately 60 servers provisioned to process well over 100 million error reports per day. From January 2003 to January 2009, the number of error reports processed by WER grew by a factor of 30.

The WER service is over provisioned to accommodate globally correlated events. For example, in February 2007, users of Windows Vista were attacked by the Renos Malware. If installed on a client, Renos caused the Windows GUI shell, explorer.exe, to crash when it tried to draw the desktop. A user's experience of a Renos infection was a continuous loop in which the shell started, crashed, and restarted. While a Renos-infected system was useless to a user, the system booted far enough to allow reporting the error to WER—on computers where automatic error reporting was enabled—and to receive updates from Windows Update (WU).

As Figure 2 shows, the number of error reports from systems infected with Renos rapidly climbed from 0 to almost 1.2 million per day. On February 27, shown in black in the graph, Microsoft released a Windows Defender signature for the Renos infection via WU. Within 3 days enough systems had received the new signature to drop reports to under 100,000 per day. Reports for the original Renos variant became insignificant by the end of March. The number of computers reporting errors was relatively small: a single computer (somehow) reported 27,000 errors, but stopped after being automatically updated.

Here in our own little corner of the computing world called "SQL Anywhere" we see what appears to be a similar error reporting service...

... I wonder what happens to those error reports?

Surely not this!

Next week: Rudy Rucker

1 comment:

Jeff Albion said...

I will personally vouch for the knowledge that there are SQL Anywhere developers assigned to reviewing the automatic submission entries, provided to Sybase by 'dbsupport'. :)

The dbsupport tool is perhaps not (yet!) "WER" quality (e.g. submit a bug, quickly analyze by the server, then refer you to a document that may or may not help you out). However, we have already had some success at correlating common stack traces that were submitted to particular instances of engineering change requests. It's our hope that (eventually) this database will help us to quickly identify crash issues that database servers are encountering, as the reports are being sent in.