Re: Query Performance SQL Server vs. Postgresql

Humair Mohammed <humairm@xxxxxxxxxxx> · Sun, 21 Nov 2010 00:00:39 -0600

I am running 64-bit Windows 7 Enterprise with 8G RAM and a Dual Core 2.67 Ghz Intel CPU. Both the SQL 2008 R2 and Postgresql are installed on the same machine. The DISTINCT FROM instead of the COALESCE does not help much. I ran 2 further tests with work_mem modifications (please note memory usage is quite low 650kb, so I am not sure if the work_mem is a factor):
First, I modified the work_mem setting to 1GB (reloaded config) from the default 1MB and I see a response time of 33 seconds. Results below from EXPLAIN ANALYZE:
"Hash Join  (cost=16212.30..52586.43 rows=92869 width=17) (actual time=26742.343..33274.317 rows=3163 loops=1)"
"  Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text = (pg.question)::text))"
"  Join Filter: ((COALESCE(pb.response, 'MISSING'::character varying))::text <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
"  ->  Seq Scan on pivotbad pb  (cost=0.00..2804.96 rows=93496 width=134) (actual time=0.055..40.710 rows=93496 loops=1)"
"  ->  Hash  (cost=7537.12..7537.12 rows=251212 width=134) (actual time=25603.460..25603.460 rows=251212 loops=1)"
"        Buckets: 1024  Batches: 64  Memory Usage: 650kB"
"        ->  Seq Scan on pivotgood pg  (cost=0.00..7537.12 rows=251212 width=134) (actual time=0.050..120.269 rows=251212 loops=1)"
"Total runtime: 33275.028 ms"

Second, I modified the work_mem setting to 2GB (reloaded config) and I see a response time of 38 seconds. Results below from EXPLAIN ANALYZE:

"Hash Join  (cost=16212.30..52586.43 rows=92869 width=17) (actual time=26574.459..38406.422 rows=3163 loops=1)"
"  Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text = (pg.question)::text))"
"  Join Filter: ((COALESCE(pb.response, 'MISSING'::character varying))::text <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
"  ->  Seq Scan on pivotbad pb  (cost=0.00..2804.96 rows=93496 width=134) (actual time=0.067..37.938 rows=93496 loops=1)"
"  ->  Hash  (cost=7537.12..7537.12 rows=251212 width=134) (actual time=26426.127..26426.127 rows=251212 loops=1)"
"        Buckets: 1024  Batches: 64  Memory Usage: 650kB"
"        ->  Seq Scan on pivotgood pg  (cost=0.00..7537.12 rows=251212 width=134) (actual time=0.038..115.319 rows=251212 loops=1)"
"Total runtime: 38406.927 ms"

By no means I am trying to compare the 2 products. When I noticed the slow behavior of COALESCE I tried it on SQL Server. And since they are running on the same machine my comment regarding apples to apples. It is possible that this is not an apples to apples comparison other than the fact that it is running on the same machine.

> From: pavel.stehule@xxxxxxxxx
> Date: Thu, 18 Nov 2010 07:14:24 +0100
> Subject: Re: Query Performance SQL Server vs. Postgresql
> To: humairm@xxxxxxxxxxx
> CC: pgsql-performance@xxxxxxxxxxxxxx
> 
> Hello,
> 
> there should be a problem in a statistic, they are out of reality.
> Please, try to use a DISTINCT OF operator now - maybe a statistic will
> be better. Next -  try to increase a work_mem. Hash join is
> untypically slow in your comp.
> 
> Regards
> 
> Pavel Stehule
> 
> 2010/11/17 Humair Mohammed <humairm@xxxxxxxxxxx>:
> > Yes strange indeed, I did rerun ANALYZE and VACCUM. Took 70 seconds to rerun
> > the query. Results from EXPLAIN ANALYZE below:
> > "Hash Join  (cost=16212.30..52586.43 rows=92869 width=17) (actual
> > time=43200.223..49502.874 rows=3163 loops=1)"
> > "  Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text =
> > (pg.question)::text))"
> > "  Join Filter: ((COALESCE(pb.response, 'MISSING'::character varying))::text
> > <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
> > "  ->  Seq Scan on pivotbad pb  (cost=0.00..2804.96 rows=93496 width=134)
> > (actual time=0.009..48.200 rows=93496 loops=1)"
> > "  ->  Hash  (cost=7537.12..7537.12 rows=251212 width=134) (actual
> > time=42919.453..42919.453 rows=251212 loops=1)"
> > "        Buckets: 1024  Batches: 64  Memory Usage: 650kB"
> > "        ->  Seq Scan on pivotgood pg  (cost=0.00..7537.12 rows=251212
> > width=134) (actual time=0.119..173.019 rows=251212 loops=1)"
> > "Total runtime: 49503.450 ms"
> >
> >> From: pavel.stehule@xxxxxxxxx
> >> Date: Wed, 17 Nov 2010 05:47:51 +0100
> >> Subject: Re: Query Performance SQL Server vs. Postgresql
> >> To: humairm@xxxxxxxxxxx
> >> CC: pgsql-performance@xxxxxxxxxxxxxx
> >>
> >> 2010/11/17 Humair Mohammed <humairm@xxxxxxxxxxx>:
> >> >
> >> > There are no indexes on the tables either in SQL Server or Postgresql -
> >> > I am
> >> > comparing apples to apples here. I ran ANALYZE on the postgresql tables,
> >> > after that query performance times are still high 42 seconds with
> >> > COALESCE
> >> > and 35 seconds with IS DISTINCT FROM.
> >> > Here is the execution plan from Postgresql for qurey - select pb.id from
> >> > pivotbad pb inner join pivotgood pg on pb.id = pg.id and pb.question =
> >> > pg.question and coalesce(pb.response,'MISSING') <>
> >> > coalesce(pg.response,'MISSING')
> >> > Execution Time: 42 seconds
> >> > "Hash Join  (cost=16212.30..48854.24 rows=93477 width=17)"
> >> > "  Hash Cond: (((pb.id)::text = (pg.id)::text) AND ((pb.question)::text
> >> > =
> >> > (pg.question)::text))"
> >> > "  Join Filter: ((COALESCE(pb.response, 'MISSING'::character
> >> > varying))::text
> >> > <> (COALESCE(pg.response, 'MISSING'::character varying))::text)"
> >> > "  ->  Seq Scan on pivotbad pb  (cost=0.00..2804.96 rows=93496
> >> > width=134)"
> >> > "  ->  Hash  (cost=7537.12..7537.12 rows=251212 width=134)"
> >> > "        ->  Seq Scan on pivotgood pg  (cost=0.00..7537.12 rows=251212
> >> > width=134)"
> >>
> >> this is little bit strange - did you ANALYZE and VACUUM?
> >>
> >> please send result of EXPLAIN ANALYZE
> >>
> >> Pavel
> >>
> >> >
> >> > And here is the execution plan from SQL Server for query - select pb.id
> >> > from
> >> > pivotbad pb inner join pivotgood pg on pb.id = pg.id and pb.question =
> >> > pg.question and isnull(pb.response,'ISNULL')<>
> >> >  isnull(pg.response,'ISNULL')
> >> > Execution Time: < 1 second
> >> > Cost: 1%  |--Parallelism(Gather Streams)
> >> > Cost: 31%       |--Hash Match(Inner Join, HASH:([pb].[ID],
> >> > [pb].[Question])=([pg].[ID], [pg].[Question]),
> >> > RESIDUAL:([master].[dbo].[pivotbad].[ID] as
> >> > [pb].[ID]=[master].[dbo].[pivotgood].[ID] as [pg].[ID] AND
> >> > [master].[dbo].[pivotbad].[Question] as
> >> > [pb].[Question]=[master].[dbo].[pivotgood].[Question] as [pg].[Question]
> >> > AND
> >> > [Expr1006]<>[Expr1007]))
> >> >     Cost: 0%  |--Bitmap(HASH:([pb].[ID], [pb].[Question]),
> >> > DEFINE:([Bitmap1008]))
> >> >             Cost: 0%    |--Compute
> >> > Scalar(DEFINE:([Expr1006]=isnull([master].[dbo].[pivotbad].[Response] as
> >> > [pb].[Response],'ISNULL')))
> >> >             Cost:  6%   |--Parallelism(Repartition Streams, Hash
> >> > Partitioning, PARTITION COLUMNS:([pb].[ID], [pb].[Question]))
> >> >             Cost: 12%  |--Table Scan(OBJECT:([master].[dbo].[pivotbad]
> >> > AS
> >> > [pb]))
> >> >             Cost: 0% |--Compute
> >> > Scalar(DEFINE:([Expr1007]=isnull([master].[dbo].[pivotgood].[Response]
> >> > as
> >> > [pg].[Response],'ISNULL')))
> >> >                 Cost: 17% |--Parallelism(Repartition Streams, Hash
> >> > Partitioning, PARTITION COLUMNS:([pg].[ID], [pg].[Question]))
> >> >                     Cost: 33% |--Table
> >> > Scan(OBJECT:([master].[dbo].[pivotgood] AS [pg]),
> >> > WHERE:(PROBE([Bitmap1008],[master].[dbo].[pivotgood].[ID] as
> >> > [pg].[ID],[master].[dbo].[pivotgood].[Question] as [pg].[Question])))
> >> >
> >> >
> >> >
> >> >> From: pavel.stehule@xxxxxxxxx
> >> >> Date: Tue, 16 Nov 2010 08:12:03 +0100
> >> >> Subject: Re: 
> >> >> To: humairm@xxxxxxxxxxx
> >> >> CC: pgsql-performance@xxxxxxxxxxxxxx
> >> >>
> >> >> 2010/11/15 Humair Mohammed <humairm@xxxxxxxxxxx>:
> >> >> > I have 2 tables with a 200,000 rows of data 3 character/string
> >> >> > columns
> >> >> > ID,
> >> >> > Question and Response. The query below compares the data between the
> >> >> > 2
> >> >> > tables based on ID and Question and if the Response does not match
> >> >> > between
> >> >> > the left table and the right table it identifies the ID's where there
> >> >> > is
> >> >> > a
> >> >> > mismatch. Running the query in SQL Server 2008 using the ISNULL
> >> >> > function
> >> >> > take a few milliseconds. Running the same query in Postgresql takes
> >> >> > over
> >> >> > 70
> >> >> > seconds. The 2 queries are below:
> >> >> > SQL Server 2008 R2 Query
> >> >> > select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and
> >> >> > t1.question = t2.question and isnull(t1.response,'ISNULL') <>
> >> >> > isnull(t2.response,'ISNULL')
> >> >>
> >> >> > Postgres 9.1 Query
> >> >> > select t1.id from table1 t1 inner join table2 t2 on t1.id = t2.id and
> >> >> > t1.question = t2.question and coalesce(t1.response,'ISNULL') <>
> >> >> > coalesce(t2.response,'ISNULL')
> >> >> > What gives?
> >> >>
> >> >> I think, so must problem can be in ugly predicate
> >> >> coalesce(t1.response,'ISNULL') <>
> >> >> > coalesce(t2.response,'ISNULL')
> >> >>
> >> >> try use a IS DISTINCT OF operator
> >> >>
> >> >> ... AND t1.response IS DISTINCT t2.response
> >> >>
> >> >> Regards
> >> >>
> >> >> Pavel Stehule
> >> >>
> >> >> p.s. don't use a coalesce in WHERE clause if it is possible.
> >> >
> >