So I spent a bit of quality time with oprofile this morning, and found once again that there's no substitute for having actual data before theorizing. Test case software: current Git HEAD (plus one code change explained below), compiled with --enable-debug to support oprofile, cassert off; no other special configure options. Running on current Fedora 13 (gcc 4.4.4 in particular). All postgresql.conf options are out-of-the-box. Test case hardware: recently purchased mid-grade desktop, dual Xeon E5503 processors (Nehalem cores, 2GHZ), 4GB DDR3-800 RAM, no-name SATA disk. Test query: "select count(*) from t" where t has 4 nonnull integer columns and 81920000 rows, occupying 3459MB. I chose that size specifically to fit into available RAM, so that on repeated executions no physical I/O will occur. On this setup I find that "select count(*)" runs in about 7.5sec when the data is fully cached in RAM, for a scanning speed of 460MB/sec. This is well in excess of what the machine's disk hardware can do: bonnie++ rates the machine's disk read speed at 152MB/sec. So in theory PG should be able to completely saturate the disk when processing a table bigger than RAM. In reality the test case run time if I've just flushed cache is about 28sec, working out to a scan rate of 123MB/sec. I expect if I'd bothered to tune the kernel readahead parameters as outlined earlier in this thread, I could get to 150MB/sec. Now of course this disk setup is far from industrial strength, but the processor isn't what you'd put in a serious database server either (in particular, its available memory bandwidth is well behind the curve). Also, the table is pretty narrow (only 16 payload bytes per row), and any wider test table would show a pretty much linear scaling of achievable scan rate versus table width. So I don't see much support here at all for the notion that we scan slower than available disk bandwidth. Further details from poking at it with oprofile: in the fully-cached case the CPU time is about 80% Postgres and 20% kernel. That kernel time is of course all to do with moving pages from kernel disk buffers into Postgres shared memory. Although I've not bothered to increase shared_buffers from the default 32MB, it wouldn't matter on this benchmark unless I were able to make shared_buffers hold the entire table ... and even then I'd only save 20%. oprofile further shows that (with stock Postgres sources) the userspace runtime breaks down like this: samples % symbol name 141267 13.0810 heapgettup_pagemode 85947 7.9585 advance_aggregates 83031 7.6885 ExecProject 78975 7.3129 advance_transition_function 75060 6.9504 heapgetpage 73540 6.8096 ExecClearTuple 69355 6.4221 ExecProcNode 59288 5.4899 heap_getnext 57745 5.3470 ExecScan 55618 5.1501 HeapTupleSatisfiesMVCC 47057 4.3574 MemoryContextReset 41904 3.8802 ExecStoreTuple 37146 3.4396 SeqNext 32206 2.9822 ExecAgg 22135 2.0496 XidInMVCCSnapshot 21142 1.9577 int8inc 19280 1.7853 AllocSetReset 18211 1.6863 hash_search_with_hash_value 16285 1.5079 TransactionIdPrecedes I also looked at the source-line-level breakdown, though that's too bulky to post here. The most interesting fact here is that tuple visibility testing (MVCC) overhead is simply nonexistent: it'd be in heapgetpage() if it were being done, which it isn't because all the pages of the table have the PageIsAllVisible bit set. In a previous run where those bits weren't set but the per-tuple hint bits were, visibility testing still only ate a percent or two of the runtime. So the theory some people have espoused in this thread that visibility testing is the bottleneck doesn't hold water either. If you go back and look at previous pgsql-hackers discussions about that, what people have been worried about is not the CPU cost of visibility testing but the need for indexscan queries to visit the heap for no other purpose than to check the visibility flags. In a seqscan it's not going to matter. I looked a bit more closely at the heapgettup_pagemode timing. The lines shown by opannotate as more than 0.1 percent of the runtime are 22545 2.2074 :{ /* heapgettup_pagemode total: 153737 15.0528 */ 5685 0.5566 : bool backward = ScanDirectionIsBackward(dir); 5789 0.5668 : if (!scan->rs_inited) 5693 0.5574 : lineindex = scan->rs_cindex + 1; 11429 1.1190 : dp = (Page) BufferGetPage(scan->rs_cbuf); 5693 0.5574 : linesleft = lines - lineindex; 5766 0.5646 : while (linesleft > 0) 5129 0.5022 : lineoff = scan->rs_vistuples[lineindex]; 44461 4.3533 : tuple->t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp); 11135 1.0903 : tuple->t_len = ItemIdGetLength(lpp); 5692 0.5573 : if (key != NULL) 5773 0.5653 : HeapKeyTest(tuple, RelationGetDescr(scan->rs_rd), 5674 0.5556 : scan->rs_cindex = lineindex; 11406 1.1168 :} There doesn't seem to be a whole lot of room for improvement there. Maybe we could shave a couple percent with some tenser coding (I'm wondering why HeapKeyTest is being reached, in particular, when there's no WHERE clause). But any local changes here will be marginal at best. One thing I did find is that the time spent in ExecProject/ExecClearTuple, amounting to nearly 15% of the runtime, is just for evaluating the arguments of the aggregate ... and count(*) hasn't got any arguments. So a patch like this improves the run speed by about 15%: diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c index a7dafeb..051e70c 100644 *** a/src/backend/executor/nodeAgg.c --- b/src/backend/executor/nodeAgg.c *************** advance_aggregates(AggState *aggstate, A *** 480,486 **** TupleTableSlot *slot; /* Evaluate the current input expressions for this aggregate */ ! slot = ExecProject(peraggstate->evalproj, NULL); if (peraggstate->numSortCols > 0) { --- 480,489 ---- TupleTableSlot *slot; /* Evaluate the current input expressions for this aggregate */ ! if (peraggstate->evalproj) ! slot = ExecProject(peraggstate->evalproj, NULL); ! else ! slot = peraggstate->evalslot; if (peraggstate->numSortCols > 0) { *************** ExecInitAgg(Agg *node, EState *estate, i *** 1728,1738 **** peraggstate->evalslot = ExecInitExtraTupleSlot(estate); ExecSetSlotDescriptor(peraggstate->evalslot, peraggstate->evaldesc); ! /* Set up projection info for evaluation */ ! peraggstate->evalproj = ExecBuildProjectionInfo(aggrefstate->args, ! aggstate->tmpcontext, ! peraggstate->evalslot, ! NULL); /* * If we're doing either DISTINCT or ORDER BY, then we have a list of --- 1731,1744 ---- peraggstate->evalslot = ExecInitExtraTupleSlot(estate); ExecSetSlotDescriptor(peraggstate->evalslot, peraggstate->evaldesc); ! /* Set up projection info for evaluation, if agg has any args */ ! if (aggrefstate->args) ! peraggstate->evalproj = ExecBuildProjectionInfo(aggrefstate->args, ! aggstate->tmpcontext, ! peraggstate->evalslot, ! NULL); ! else ! peraggstate->evalproj = NULL; /* * If we're doing either DISTINCT or ORDER BY, then we have a list of bringing the oprofile results to samples % symbol name 181660 17.9017 heapgettup_pagemode 138049 13.6040 advance_transition_function 102865 10.1368 advance_aggregates 80948 7.9770 ExecProcNode 79943 7.8780 heap_getnext 73384 7.2316 ExecScan 60607 5.9725 MemoryContextReset 53889 5.3105 ExecStoreTuple 46666 4.5987 SeqNext 40535 3.9945 ExecAgg 33481 3.2994 int8inc 32202 3.1733 heapgetpage 26068 2.5689 AllocSetReset 18493 1.8224 hash_search_with_hash_value 8679 0.8553 LWLockAcquire 6615 0.6519 ExecSeqScan 6583 0.6487 LWLockRelease 3928 0.3871 hash_any 3715 0.3661 ReadBuffer_common (note that this, not the stock code, is what corresponds to the 7.5sec runtime I quoted above --- it's about 8.5sec without that change). At this point what we've got is 25% of the runtime in nodeAgg.c overhead, and it's difficult to see how to get any real improvement without tackling that. Rather than apply the patch shown above, I'm tempted to think about hard-wiring COUNT(*) as a special case in nodeAgg.c such that we don't go through advance_aggregates/advance_transition_function at all, but just increment a counter directly. However, that would very clearly be optimizing COUNT(*) and nothing else. Given the opinions expressed elsewhere in this thread that heavy reliance on COUNT(*) represents bad application design, I'm not sure that such a patch would meet with general approval. Actually the patch shown above is optimizing COUNT(*) and nothing else, too, since it's hard to conceive of any other zero-argument aggregate. Anyway, if anyone is hot to make COUNT(*) faster, that's where to look. I don't think any of the previous discussion in this thread is on-point at all, except for the parts where people suggested avoiding it. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance