mark@xxxxxxxxxxxxxx wrote:
On Wed, Jan 18, 2006 at 09:57:50AM +0100, Alessandro Baretta wrote:
I there is to be a change to PostgreSQL to optimize for this case, I
suggest it involve the caching of query plans, executor plans, query
results (materialized views?), LIMIT and OFFSET. If we had all of
this, you would have exactly what you want, while benefitting many
more people than just you. No ugly 'persistent state cursors' or
'import/export cursor state' implementation. People would automatically
benefit, without changing their applications.
Actually, many of the features you mention (caching executor plans--that is
imposing the use of prepared queries, caching query results and materializing
views) I have already implemented in my "middleware". Somehow, apparently, my
intuition on the problem of determining what ought to be delegated to the DBMS
and what to the "middleware" is the opposites of most people on this list. As I
mentioned earlier, I democratically accept the position of the majority--and
monarchically I accept Tom's. And, scientifically, I have taken resposibility
for proving myself wrong: I have stated my assumptions, I have formulated the
hypothesis, I have designed an experiment capable of disproving it, and I have
collected the data. Here are the steps and the results of the experiment.
Assumptions: Google defines the "best current practices" in web applications.
Hypothesis: The "best practice" for returning large data sets is to devise an
algorithm (say, persistent cursors, for example) allowing subdivision of
recordset is pages of a fixed maximum size, in such a way that sequentially
fetching pages of records requires the system to compute each record only once.
Experiment: Given the stated assumption, record the time taken by Google to
retrieve a sequence of pages of results, relative to the same query string.
Repeat the experiment with different settings. Notice that Google actually
displays on the results page the time taken to process the request.
Results: I'll omit the numerical data, which everyone can easily obtain in only
a few minutes, repeating the experiment. I used several query strings containing
very common words ("linux debian", "linux kernel", "linux tux"), each yielding
millions of results. I set Google to retrieve 100 results per page. Then I ran
the query and paged through the data set. The obvious result is that execution
time is a monotonously growing function of the page number. This clearly
indicates that Google does not use any algorithm of the proposed kind, but
rather an OFFSET/LIMIT strategy, thus disproving the hypothesis.
It must also be noted that Google refuses to return more than 1000 results per
query, thus indicating that the strategy the adopted quite apparently cannot
scale indefinitely, for on a query returning a potentially flooding dataset, a
user paging through the data would experience a linear slowdown on the number of
pages already fetched, and the DBMS workload would also be linear on the number
of fetched pages.
I do not like this approach, but the fact that Google came up with no better
solution is a clear indication that Tome et al. are more than correct.
Alex
--
*********************************************************************
http://www.barettadeit.com/
Baretta DE&IT
A division of Baretta SRL
tel. +39 02 370 111 55
fax. +39 02 370 111 54
Our technology:
The Application System/Xcaml (AS/Xcaml)
<http://www.asxcaml.org/>
The FreerP Project
<http://www.freerp.org/>