I'm not sure if decreasing BLKSZ is the way to go. It would allow you to have more smaller blocks in memory, but the actual coverage of the index would remain the same; if only 33% of the index fits in memory with the 8K BLKSZ then only 33% would fit in memory with a 4k BLKSZ. I can see where you're going if the tree nodes for all 15 million key entries fit in memory as well as the most recent nodes for the logtime nodes lower down in the index; basically trying to make sure that the "right" 33% of the index is in memory. But it seems like it might be more useful to have two indexes, one on logtime and one on key. Inserts into the logtime index would be correlated with your insert order and as such be cache-friendly so that's not an issue. The index on just the key column would be at least as small as the active subset of a combined index, so performance should be at least as good as you could possibly achieve by reducing BLKSIZE. PG 8.1 is smart enough to use a bitmap index scan to combine the two indexes at query time; if that gives you adequate performance then it would be simpler than reducing BLKSIZE. -- Mark Lewis On Mon, 2006-09-25 at 17:54 -0400, Marc Morin wrote: > > Would it be possible to change the primary key to > > (logtime,key)? This could help keeping the "working window" small. > > No, the application accessing the data wants all the rows between start > and end time for a particular key value. > > > > > Secondly, the real working set is smaller, as the rows are > > all inserted at the end of the table, filling each page until > > it's full, so only the last pages are accessed. There's no > > relation between the index order, and the order of data on > > disk, unless you CLUSTER. > > I'd theorizing that my problem is in updating the index itself and not > the heap. Insert order > Refers to the order by which the applications are inserting the rows and > as such, the order by > Which the index is being updated. This in turn, is causing the b-tree > to be traverse. Problem > Is the working set of blocks at the bottom of the btree is too big for > my cache. > > > > > > Any comment on other affects or gotchas with lowering the size of > > > BLKSZ? Currently, our database is thrashing its cache of blocks we > > > we're getting only ~100 inserts/second, every insert results in a > > > evict-read-modify operation. > > > > I'm not shure that's the correct diagnosis. > > > > Do you have one transaction per insert? Every transaction > > means a forced sync to the disk, so you won't get more than > > about 100-200 commits per second, depending on your actual > > disk rotation speed. > > No, an insert consists of roughly 10,000+ rows per transaction block. > > > > > To improve concurrency of the "numer of inserters" running in > > parallel, try to tweak the config variables commit_delay and > > commit_sibling, so you get a higher overall throughput at > > cost of an increased delay per connection, and increase the > > number of inserters. Using sensible tweaking, the throughput > > should scale nearly linear with the number of backens. :-) > > I don't think this will help us here due to large transactions already. > > > > > If feasible for your application, you can also bundle several > > log entries into a single transaction. If you're CPU bound, > > you can use COPY instead of INSERT or (if you can wait for > > 8.2) the new multi-row INSERT to further improve performance, > > but I doubt that you're CPU bound. > > > > > The only way to "really" get over the sync limit is to have > > (at least) the WAL on a battery backed ram / SSD media that > > has no "spinning disk" > > physical limit, or abandon crash safety by turning fsync off. > > Again, problem is not with WAL writing, already on it's own raid1 disk > pair. The > I/O pattern we see is about 1-2% load on WAL and 100% load on the array > holding the indexes and tables. Throughput is very low, something like > 150k-200K bytes/second of real rows being deposited on the disk. > > The disks are busy seeking all over the disk platter to fetch a block, > add a single row, then seek to another spot and write back a previously > dirty buffer.... > > > > > Thanks, > > Markus. > > -- > > Markus Schaber | Logical Tracking&Tracing International AG > > Dipl. Inf. | Software Development GIS > > > > Fight against software patents in Europe! www.ffii.org > > www.nosoftwarepatents.org > > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster