Re: PG-related ACM Article: "The Pathologies of Big Data"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Aug 7, 2009 at 9:17 PM, Josh Kupershmidt<schmiddy@xxxxxxxxx> wrote:
> Just stumbled across this recent article published in the
> Communications of the ACM:
>
> http://cacm.acm.org/magazines/2009/8/34493-the-pathologies-of-big-data/fulltext
>
> The author shares some insights relating to difficulties processing a
> 6.75 billion-row
> table, a dummy table representing census-type data for everyone on earth, in
> Postgres.
>
> I'd really like to replicate the author's experiment, but it's not clear from
> the article what his table definition looks like. He claims to be using a
> 16-byte record to store the several columns he needs for each row, so perhaps
> he's using a user-defined type?

or four integers, or who knows. Postgres's per-row overhead is 24
bytes plus a 16-bit line pointer so you're talking about 42 bytes per
row. There's per-page overhead and alignment but in this case it
shouldn't be much.



> The author implies with his definition of "big data" that the dataset he
> analyzed is "... too large to be placed in a relational database... ". From
> Fig. 2, the SELECT query he ran took just under 10^5 seconds (~28 hours) when
> run on 6.75 billion rows. This amount of time for the query didn't seem
> surprising to me given how many rows he has to process, but in a recent post
> on comp.databases.ingres someone claimed that on a far-inferior PC, Ingres
> ran the same SELECT query in 105 minutes! This would be very impressive (a
> 10-fold improvement over Postgres) if true.

6.75 billion * 42 bytes is 283.5GB.

Assuming you stick that on a single spindle capable of 100MB/s:

You have: 283.5GB / (100MB/s)
You want: min
        * 47.25

So something's not adding up.

> One intriguing tidbit I picked up from the article: "in modern systems, as
> demonstrated in the figure, random access to memory is typically slower than
> sequential access to disk." In hindsight, this seems plausible (since modern
> disks can sustain sequential reads at well over 100MB/sec).

Sure, but the slowest PCIe bus can sustain 1GB/s and your memory
bandwidth is probably at least 8GB/s.

-- 
greg
http://mit.edu/~gsstark/resume.pdf

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux