Re: Air-traffic benchmark

Lefteris <lsidir@xxxxxxxxx> · Fri, 8 Jan 2010 00:08:57 +0100

On Thu, Jan 7, 2010 at 11:57 PM, Lefteris <lsidir@xxxxxxxxx> wrote:
> Hi Greg,
>
> thank you for your help. The changes I did on the dataset was just
> removing the last comma from the CSV files as it was interpreted by pg
> as an extra column. The schema I used, the load script and queries can
> be found at:
>
> http://homepages.cwi.nl/~lsidir/postgres/
>
> (I understood that if I attach these files here, my email will not
> reach the list so I give you a link to download them).
>
> Also since you are interesting on the benchmark, you can also check
>
> http://homepages.cwi.nl/~mk/ontimeReport
>
> for a report of various experiments with MonetDB and comparison with
> previously published numbers.
>
> The schema I used for pg is slightly different from that one of
> MonetDB since the delay fields could not be parsed by pg as integers
> but only as varchar(4). Hence the extra index on DepDelay field:)

Sorry, I mean the ArrTime DepTime fields were changed, because they
apear on the data as HHMM, but these fields were not used on the
queries. The index on DepDelay was done for q3,4,5 and 7

>
> Also at http://homepages.cwi.nl/~lsidir/PostgreSQL-ontimeReport you
> can see the detailed times I got from postgres.
>
> I really appreciate your help! this is a great opportunity for me to
> get some feeling and insights on postgres since I never had the chance
> to use it in a large scale project.
>
> lefteris
>
>
> On Thu, Jan 7, 2010 at 11:21 PM, Greg Smith <greg@xxxxxxxxxxxxxxx> wrote:
>> Lefteris wrote:
>>>
>>> So we all agree that the problem is on the scans:)
>>>
>>> So the next question is why changing shared memory buffers will fix
>>> that? i only have one session with one connection, do I have like many
>>> reader workers or something?
>>>
>>
>> I wouldn't expect it to.  Large sequential scans like this one are optimized
>> in PostgreSQL to only use up a small portion of the shared_buffers cache.
>>  Allocating more RAM to the database won't improve the fact that you're
>> spending the whole time waiting for physical I/O to happen very much.
>>
>> What might help is increasing effective_cache_size a lot though, because
>> there you might discover the database switching to all new sorts of plans
>> for some of these queries.  But, again, that doesn't impact the situation
>> where a sequential scan is the only approach.
>>
>> I have this whole data set on my PC already and have been trying to find
>> time to get it loaded and start my own tests here, it is a quite interesting
>> set of information.  Can you tell me what you had to do in order to get it
>> running in PostgreSQL?  If you made any customizations there, I'd like to
>> get a copy of them.  Would save me some time and help me get to where I
>> could give suggestions out if I had a "pgdumpall --schema-only" dump from
>> your database for example, or however you got the schema into there, and the
>> set of PostgreSQL-compatible queries you're using.
>>
>> By the way:  if anybody else wants to join in, here's a script that
>> generates a script to download the whole data set:
>>
>> #!/usr/bin/env python
>> for y in range(1988,2010):
>>   for m in range(1,13):
>>       print "wget --limit-rate=100k
>> http://www.transtats.bts.gov/Download/On_Time_On_Time_Performance_%s_%s.zip";
>> % (y,m)
>>
>> It's 3.8GB of download that uncompresses into 46GB of CSV data, which is why
>> I put the rate limiter on there--kept it from clogging my entire Internet
>> connection.
>>
>> --
>> Greg Smith    2ndQuadrant   Baltimore, MD
>> PostgreSQL Training, Services and Support
>> greg@xxxxxxxxxxxxxxx  www.2ndQuadrant.com
>>
>>
>

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance