On Fri, Aug 7, 2009 at 2:17 PM, Josh Kupershmidt<schmiddy@xxxxxxxxx> wrote: > Just stumbled across this recent article published in the > Communications of the ACM: > > http://cacm.acm.org/magazines/2009/8/34493-the-pathologies-of-big-data/fulltext > > The author shares some insights relating to difficulties processing a > 6.75 billion-row > table, a dummy table representing census-type data for everyone on earth, in > Postgres. > > I'd really like to replicate the author's experiment, but it's not clear from > the article what his table definition looks like. He claims to be using a > 16-byte record to store the several columns he needs for each row, so perhaps > he's using a user-defined type? > > The author implies with his definition of "big data" that the dataset he > analyzed is "... too large to be placed in a relational database... ". From > Fig. 2, the SELECT query he ran took just under 10^5 seconds (~28 hours) when > run on 6.75 billion rows. This amount of time for the query didn't seem > surprising to me given how many rows he has to process, but in a recent post > on comp.databases.ingres someone claimed that on a far-inferior PC, Ingres > ran the same SELECT query in 105 minutes! This would be very impressive (a > 10-fold improvement over Postgres) if true. Well, from the article, I got the feeling he never showed up here on the list to ask for help, and he just assumed he knew enough about postgresql to say it couldn't scale well. I just checked the archives, and his name doesn't show up. When you look at his slides, this one makes we wonder about a few points: http://deliveryimages.acm.org/10.1145/1540000/1536632/figs/f3.jpg He was using 8 15kSAS in RAID-5. Just the fact that he's using RAID-5 to test makes me wonder, but for his mostly-read workload it's useful. But on his machine he was only getting 53MB/second sequential reads? That makes no sense. I was getting 50MB/s from a 4 disk SATA RAID on older 120G hard drives years ago. SAS drives haven't been around that long really, so I can't imagine having 7 disks (1 for parity) and only getting 53/7 or 7.5MB/second from them. That's horrible. I had 9 Gig 5.25 full height drives faster than that back in the day, on eight bit scsi controllers. His memory read speed was pretty bad too at only 350MB/s. I have a 12 drive RAID-10 that can outrun his memory reads. So I tend to think his OS was setup poorly, or his hardware was broken, or something like that. > The author complained that "on larger tables [Postgres' planner] switched to > sorting by grouping columns", which he blamed for the slow query execution. I > don't personally see this plan as a problem, but maybe someone can enlighten > me. I'm sure that if he was on faster hardware it might have been quite a bit faster. I'd love to try his test on a real server with RAID-10 and lots of memory. I'm certain I could get the run time down by a couple factors. I wonder if he cranked up work_mem? I wonder if he even upped shared_buffers? > One intriguing tidbit I picked up from the article: "in modern systems, as > demonstrated in the figure, random access to memory is typically slower than > sequential access to disk." In hindsight, this seems plausible (since modern > disks can sustain sequential reads at well over 100MB/sec). This is generally always true. But his numbers are off by factors for a modern system. Pentium IIs could sequentially read in the several hundreds of megs per second from memory. Any modern piece of kit, including my laptop, can do much much better than 350Meg/second from memory. I wonder if he'd make his work available to mess with, as it seems he did a pretty poor job setting up his database server / OS for this test. At the very least I wonder if he has a colleague on this list who might point him to us so we can try to help him improve the dismal performance he seems to be getting. Or maybe he could just google "postgresql performance tuning" and take it from there... -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance