Re: Suspending SELECTs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



mark@xxxxxxxxxxxxxx wrote:
On Wed, Jan 18, 2006 at 09:57:50AM +0100, Alessandro Baretta wrote:

I there is to be a change to PostgreSQL to optimize for this case, I
suggest it involve the caching of query plans, executor plans, query
results (materialized views?), LIMIT and OFFSET. If we had all of
this, you would have exactly what you want, while benefitting many
more people than just you. No ugly 'persistent state cursors' or
'import/export cursor state' implementation. People would automatically
benefit, without changing their applications.

Actually, many of the features you mention (caching executor plans--that is imposing the use of prepared queries, caching query results and materializing views) I have already implemented in my "middleware". Somehow, apparently, my intuition on the problem of determining what ought to be delegated to the DBMS and what to the "middleware" is the opposites of most people on this list. As I mentioned earlier, I democratically accept the position of the majority--and monarchically I accept Tom's. And, scientifically, I have taken resposibility for proving myself wrong: I have stated my assumptions, I have formulated the hypothesis, I have designed an experiment capable of disproving it, and I have collected the data. Here are the steps and the results of the experiment.

Assumptions: Google defines the "best current practices" in web applications.

Hypothesis: The "best practice" for returning large data sets is to devise an algorithm (say, persistent cursors, for example) allowing subdivision of recordset is pages of a fixed maximum size, in such a way that sequentially fetching pages of records requires the system to compute each record only once.

Experiment: Given the stated assumption, record the time taken by Google to retrieve a sequence of pages of results, relative to the same query string. Repeat the experiment with different settings. Notice that Google actually displays on the results page the time taken to process the request.

Results: I'll omit the numerical data, which everyone can easily obtain in only a few minutes, repeating the experiment. I used several query strings containing very common words ("linux debian", "linux kernel", "linux tux"), each yielding millions of results. I set Google to retrieve 100 results per page. Then I ran the query and paged through the data set. The obvious result is that execution time is a monotonously growing function of the page number. This clearly indicates that Google does not use any algorithm of the proposed kind, but rather an OFFSET/LIMIT strategy, thus disproving the hypothesis.

It must also be noted that Google refuses to return more than 1000 results per query, thus indicating that the strategy the adopted quite apparently cannot scale indefinitely, for on a query returning a potentially flooding dataset, a user paging through the data would experience a linear slowdown on the number of pages already fetched, and the DBMS workload would also be linear on the number of fetched pages.

I do not like this approach, but the fact that Google came up with no better solution is a clear indication that Tome et al. are more than correct.

Alex

--
*********************************************************************
http://www.barettadeit.com/
Baretta DE&IT
A division of Baretta SRL

tel. +39 02 370 111 55
fax. +39 02 370 111 54

Our technology:

The Application System/Xcaml (AS/Xcaml)
<http://www.asxcaml.org/>

The FreerP Project
<http://www.freerp.org/>


[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux