Re: Slow count(*) again...

Tom Lane <tgl@xxxxxxxxxxxxx> · Tue, 12 Oct 2010 13:07:58 -0400

So I spent a bit of quality time with oprofile this morning, and found
once again that there's no substitute for having actual data before
theorizing.

Test case software: current Git HEAD (plus one code change explained
below), compiled with --enable-debug to support oprofile, cassert off;
no other special configure options.  Running on current Fedora 13 (gcc
4.4.4 in particular).  All postgresql.conf options are out-of-the-box.

Test case hardware: recently purchased mid-grade desktop, dual Xeon
E5503 processors (Nehalem cores, 2GHZ), 4GB DDR3-800 RAM, no-name
SATA disk.

Test query: "select count(*) from t" where t has 4 nonnull integer
columns and 81920000 rows, occupying 3459MB.  I chose that size
specifically to fit into available RAM, so that on repeated executions
no physical I/O will occur.

On this setup I find that "select count(*)" runs in about 7.5sec when
the data is fully cached in RAM, for a scanning speed of 460MB/sec.
This is well in excess of what the machine's disk hardware can do:
bonnie++ rates the machine's disk read speed at 152MB/sec.  So in theory
PG should be able to completely saturate the disk when processing a table
bigger than RAM.  In reality the test case run time if I've just flushed
cache is about 28sec, working out to a scan rate of 123MB/sec.  I expect
if I'd bothered to tune the kernel readahead parameters as outlined
earlier in this thread, I could get to 150MB/sec.

Now of course this disk setup is far from industrial strength, but the
processor isn't what you'd put in a serious database server either (in
particular, its available memory bandwidth is well behind the curve).
Also, the table is pretty narrow (only 16 payload bytes per row), and
any wider test table would show a pretty much linear scaling of achievable
scan rate versus table width.  So I don't see much support here at all
for the notion that we scan slower than available disk bandwidth.

Further details from poking at it with oprofile: in the fully-cached
case the CPU time is about 80% Postgres and 20% kernel.  That kernel
time is of course all to do with moving pages from kernel disk buffers
into Postgres shared memory.  Although I've not bothered to increase
shared_buffers from the default 32MB, it wouldn't matter on this benchmark
unless I were able to make shared_buffers hold the entire table ... and
even then I'd only save 20%.

oprofile further shows that (with stock Postgres sources) the userspace
runtime breaks down like this:

samples  %        symbol name
141267   13.0810  heapgettup_pagemode
85947     7.9585  advance_aggregates
83031     7.6885  ExecProject
78975     7.3129  advance_transition_function
75060     6.9504  heapgetpage
73540     6.8096  ExecClearTuple
69355     6.4221  ExecProcNode
59288     5.4899  heap_getnext
57745     5.3470  ExecScan
55618     5.1501  HeapTupleSatisfiesMVCC
47057     4.3574  MemoryContextReset
41904     3.8802  ExecStoreTuple
37146     3.4396  SeqNext
32206     2.9822  ExecAgg
22135     2.0496  XidInMVCCSnapshot
21142     1.9577  int8inc
19280     1.7853  AllocSetReset
18211     1.6863  hash_search_with_hash_value
16285     1.5079  TransactionIdPrecedes

I also looked at the source-line-level breakdown, though that's too bulky
to post here.  The most interesting fact here is that tuple visibility
testing (MVCC) overhead is simply nonexistent: it'd be in heapgetpage()
if it were being done, which it isn't because all the pages of the table
have the PageIsAllVisible bit set.  In a previous run where those bits
weren't set but the per-tuple hint bits were, visibility testing still
only ate a percent or two of the runtime.  So the theory some people have
espoused in this thread that visibility testing is the bottleneck doesn't
hold water either.  If you go back and look at previous pgsql-hackers
discussions about that, what people have been worried about is not the CPU
cost of visibility testing but the need for indexscan queries to visit
the heap for no other purpose than to check the visibility flags.  In a
seqscan it's not going to matter.

I looked a bit more closely at the heapgettup_pagemode timing.  The
lines shown by opannotate as more than 0.1 percent of the runtime are

 22545  2.2074 :{ /* heapgettup_pagemode total: 153737 15.0528 */
  5685  0.5566 :	bool		backward = ScanDirectionIsBackward(dir);
  5789  0.5668 :		if (!scan->rs_inited)
  5693  0.5574 :			lineindex = scan->rs_cindex + 1;
 11429  1.1190 :		dp = (Page) BufferGetPage(scan->rs_cbuf);
  5693  0.5574 :		linesleft = lines - lineindex;
  5766  0.5646 :		while (linesleft > 0)
  5129  0.5022 :			lineoff = scan->rs_vistuples[lineindex];
 44461  4.3533 :			tuple->t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
 11135  1.0903 :			tuple->t_len = ItemIdGetLength(lpp);
  5692  0.5573 :			if (key != NULL)
  5773  0.5653 :				HeapKeyTest(tuple, RelationGetDescr(scan->rs_rd),
  5674  0.5556 :					scan->rs_cindex = lineindex;
 11406  1.1168 :}

There doesn't seem to be a whole lot of room for improvement there.
Maybe we could shave a couple percent with some tenser coding (I'm
wondering why HeapKeyTest is being reached, in particular, when there's
no WHERE clause).  But any local changes here will be marginal at best.

One thing I did find is that the time spent in ExecProject/ExecClearTuple,
amounting to nearly 15% of the runtime, is just for evaluating the
arguments of the aggregate ... and count(*) hasn't got any arguments.
So a patch like this improves the run speed by about 15%:

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index a7dafeb..051e70c 100644
*** a/src/backend/executor/nodeAgg.c
--- b/src/backend/executor/nodeAgg.c
*************** advance_aggregates(AggState *aggstate, A
*** 480,486 ****
  		TupleTableSlot *slot;

  		/* Evaluate the current input expressions for this aggregate */
! 		slot = ExecProject(peraggstate->evalproj, NULL);

  		if (peraggstate->numSortCols > 0)
  		{
--- 480,489 ----
  		TupleTableSlot *slot;

  		/* Evaluate the current input expressions for this aggregate */
! 		if (peraggstate->evalproj)
! 			slot = ExecProject(peraggstate->evalproj, NULL);
! 		else
! 			slot = peraggstate->evalslot;

  		if (peraggstate->numSortCols > 0)
  		{
*************** ExecInitAgg(Agg *node, EState *estate, i
*** 1728,1738 ****
  		peraggstate->evalslot = ExecInitExtraTupleSlot(estate);
  		ExecSetSlotDescriptor(peraggstate->evalslot, peraggstate->evaldesc);

! 		/* Set up projection info for evaluation */
! 		peraggstate->evalproj = ExecBuildProjectionInfo(aggrefstate->args,
! 														aggstate->tmpcontext,
! 														peraggstate->evalslot,
! 														NULL);

  		/*
  		 * If we're doing either DISTINCT or ORDER BY, then we have a list of
--- 1731,1744 ----
  		peraggstate->evalslot = ExecInitExtraTupleSlot(estate);
  		ExecSetSlotDescriptor(peraggstate->evalslot, peraggstate->evaldesc);

! 		/* Set up projection info for evaluation, if agg has any args */
! 		if (aggrefstate->args)
! 			peraggstate->evalproj = ExecBuildProjectionInfo(aggrefstate->args,
! 															aggstate->tmpcontext,
! 															peraggstate->evalslot,
! 															NULL);
! 		else
! 			peraggstate->evalproj = NULL;

  		/*
  		 * If we're doing either DISTINCT or ORDER BY, then we have a list of

bringing the oprofile results to

samples  %        symbol name
181660   17.9017  heapgettup_pagemode
138049   13.6040  advance_transition_function
102865   10.1368  advance_aggregates
80948     7.9770  ExecProcNode
79943     7.8780  heap_getnext
73384     7.2316  ExecScan
60607     5.9725  MemoryContextReset
53889     5.3105  ExecStoreTuple
46666     4.5987  SeqNext
40535     3.9945  ExecAgg
33481     3.2994  int8inc
32202     3.1733  heapgetpage
26068     2.5689  AllocSetReset
18493     1.8224  hash_search_with_hash_value
8679      0.8553  LWLockAcquire
6615      0.6519  ExecSeqScan
6583      0.6487  LWLockRelease
3928      0.3871  hash_any
3715      0.3661  ReadBuffer_common

(note that this, not the stock code, is what corresponds to the 7.5sec
runtime I quoted above --- it's about 8.5sec without that change).

At this point what we've got is 25% of the runtime in nodeAgg.c overhead,
and it's difficult to see how to get any real improvement without tackling
that.  Rather than apply the patch shown above, I'm tempted to think about
hard-wiring COUNT(*) as a special case in nodeAgg.c such that we don't go
through advance_aggregates/advance_transition_function at all, but just
increment a counter directly.  However, that would very clearly be
optimizing COUNT(*) and nothing else.  Given the opinions expressed
elsewhere in this thread that heavy reliance on COUNT(*) represents
bad application design, I'm not sure that such a patch would meet with
general approval.

Actually the patch shown above is optimizing COUNT(*) and nothing else,
too, since it's hard to conceive of any other zero-argument aggregate.

Anyway, if anyone is hot to make COUNT(*) faster, that's where to look.
I don't think any of the previous discussion in this thread is on-point
at all, except for the parts where people suggested avoiding it.

			regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance