On Fri, Aug 7, 2009 at 9:17 PM, Josh Kupershmidt<schmiddy@xxxxxxxxx> wrote: > Just stumbled across this recent article published in the > Communications of the ACM: > > http://cacm.acm.org/magazines/2009/8/34493-the-pathologies-of-big-data/fulltext > > The author shares some insights relating to difficulties processing a > 6.75 billion-row > table, a dummy table representing census-type data for everyone on earth, in > Postgres. > > I'd really like to replicate the author's experiment, but it's not clear from > the article what his table definition looks like. He claims to be using a > 16-byte record to store the several columns he needs for each row, so perhaps > he's using a user-defined type? or four integers, or who knows. Postgres's per-row overhead is 24 bytes plus a 16-bit line pointer so you're talking about 42 bytes per row. There's per-page overhead and alignment but in this case it shouldn't be much. > The author implies with his definition of "big data" that the dataset he > analyzed is "... too large to be placed in a relational database... ". From > Fig. 2, the SELECT query he ran took just under 10^5 seconds (~28 hours) when > run on 6.75 billion rows. This amount of time for the query didn't seem > surprising to me given how many rows he has to process, but in a recent post > on comp.databases.ingres someone claimed that on a far-inferior PC, Ingres > ran the same SELECT query in 105 minutes! This would be very impressive (a > 10-fold improvement over Postgres) if true. 6.75 billion * 42 bytes is 283.5GB. Assuming you stick that on a single spindle capable of 100MB/s: You have: 283.5GB / (100MB/s) You want: min * 47.25 So something's not adding up. > One intriguing tidbit I picked up from the article: "in modern systems, as > demonstrated in the figure, random access to memory is typically slower than > sequential access to disk." In hindsight, this seems plausible (since modern > disks can sustain sequential reads at well over 100MB/sec). Sure, but the slowest PCIe bus can sustain 1GB/s and your memory bandwidth is probably at least 8GB/s. -- greg http://mit.edu/~gsstark/resume.pdf -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance