Thursday, July 7, 2011

The Thursday Quote - Michael Stonebraker

"... any programmer seriously interested in performance runs transactions using a stored-procedure interface, rather than SQL commands over ODBC/JDBC."
- 10 Rules for Scalable Performance in ‘Simple Operation’ Datastores, by Michael Stonebraker and Rick Cattell, Communications of the ACM, June 2011

That quote is taken slightly out of context from a long but very readable and fascinating article that I'm reading for the second time (I said the article is readable and fascinating, I didn't say it was shallow).

Here's what it says to me: Since many application developers working with frameworks have lost (or never had) the ability to write serious SQL, and therefore couldn't code a stored procedure to save their lives, they are doomed if they develop a serious interest in database performance.

Here are some snippets to show the quote in context (the bolding is mine):

The relational model of data was proposed in 1970 by Ted Codd as the best solution for the DBMS problems of the day — business data processing. Early relational systems included System R and Ingres, and almost all commercial relational DBMS (RDBMS) implementations today trace their roots to these two systems. As such, unless you squint, the dominant commercial vendors — Oracle, IBM, and Microsoft — as well as the major open source systems — MySQL and PostgreSQL — all look about the same today; we term these systems general-purpose traditional row stores, or GPTRS, sharing the following features:
  • Disk-oriented storage;

  • Tables stored row-by-row on disk, hence, a row store;

  • B-trees as the indexing mechanism;

  • Dynamic locking as the concurrency-control mechanism;

  • A write-ahead log, or WAL, for crash recovery;

  • SQL as the access language; and

  • A “row-oriented” query optimizer and executor, pioneered in System R.
. . .
Rule 2. High-level languages are good and need not hurt performance. Work in a SQL transaction can include the following components:
  • Overhead resulting from the optimizer choosing an inferior execution plan;

  • Overhead of communicating with the DBMS;

  • Overhead inherent in coding in a high-level language;

  • Overhead for services (such as concurrency control, crash recovery, and data integrity); and

  • Truly useful work to be performed, no matter what
. . .
The second source of overhead is communicating with the DBMS. For security reasons, RDBMSs insist on the application being run in a separate address space, using ODBC or JDBC for DBMS interaction. The overhead of these communication protocols is high; running a SQL transaction requires several back-and-forth messages over TCP/IP. Consequently, any programmer seriously interested in performance runs transactions using a stored-procedure interface, rather than SQL commands over ODBC/JDBC. In the case of stored procedures, a transaction is a single over-and-back message. The DBMS further reduces communication overhead by batching multiple transactions in one call. The communication cost is a function of the interface selected, can be minimized, and has nothing to do with the language level of the interaction.
. . .
The quote is still slightly out of context because the article isn't just talking about conversational SQL versus stored procedures, but the authors do repeat the point in a later section...

. . .
A conventional disk-based DBMS clearly spends the overwhelming majority of its cycles on overhead activity. To go a lot faster, the DBMS must avoid all the overhead components discussed here; for example, a main memory DBMS with conventional multithreading, locking, and recovery is only marginally faster than its diskbased counterpart. A NoSQL or other database engine will not dramatically outperform a GPTRS implementation, unless all these overhead components are addressed or the GPTRS solution has not been properly architected (by, say, using conversational SQL rather than a compiled stored procedure interface).
. . .

No comments: