Search Postgresql Archives

Re: SQL WHERE: many sql or large IN()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



This is some good stuff and I can use the explain analyze going forward.
But I can't get these VALUES queries to work.

I checked and I am on version 8.1. but I think from the docs that I should still be able to do this.

queue=> select t.* from test t, (values(4, 23,84884,1,324234)) as v where t.value = v.column1;
ERROR:  syntax error at or near ")" at character 54
LINE 1: ...ect t.* from test t, (values(4, 23,84884,1,324234)) as v whe... ^ Not sure if the carat is in the right place on this email, but it's the second ')' that's the problem.


On Apr 6, 2007, at 2:26 PM, Listmail wrote:

On Fri, 06 Apr 2007 18:45:15 +0200, Markus Schiltknecht <markus@xxxxxxxxxx> wrote:

Hi,

tom wrote:
Initially it seems that the WHERE IN (...) approach takes a turn for the worse when the list gets very large.


Since I use this a lot on webpages, I thought maybe a little benchmark is in order ?

CREATE TABLE test (id SERIAL PRIMARY KEY, value TEXT NOT NULL );
INSERT INTO test (value) SELECT * FROM generate_series( 1, 1000000 );
CREATE INDEX test_value ON test( value );
ANALYZE test;

My script runs EXPLAIN ANALYZE 10 times and keeps the fastest timing, then displays it. Then it executes the query enough times so that 1 second is elapsed (also fetching the results), then prints the real runtime.

Well, the table is clustered (since it was created from generate_series) so of course, bitmap index scan rules. Get the first 300 rows of the table using column 'value' which is an indexed TEXT column.
	(using 'id' which is an integer is a bit faster)

SELECT * FROM test WHERE value IN <300 values>
Bitmap Heap Scan on test (cost=1225.83..2317.08 rows=300 width=13) (actual time=3.736..3.807 rows=300 loops=1)
  Recheck Cond: (value = ANY ('{<300 values>}'::text[]))
-> Bitmap Index Scan on test_value (cost=0.00..1225.75 rows=300 width=0) (actual time=3.717..3.717 rows=300 loops=1)
        Index Cond: (value = ANY ('{<300 values>}'::text[]))
Explain Analyze runtime: 1 x 3.896 ms = 3.896 ms
Real runtime: 1 x 6.027 ms = 6.027 ms (timed on 257 iterations)
----------------------------------------
SELECT t.* FROM test t, ( VALUES <300 values> ) AS v WHERE t.value=v.column1 Nested Loop (cost=0.00..2447.27 rows=300 width=13) (actual time=0.035..4.724 rows=300 loops=1) -> Values Scan on "*VALUES*" (cost=0.00..3.75 rows=300 width=32) (actual time=0.002..0.219 rows=300 loops=1) -> Index Scan using test_value on test t (cost=0.00..8.13 rows=1 width=13) (actual time=0.013..0.014 rows=1 loops=300)
        Index Cond: (t.value = "*VALUES*".column1)
Explain Analyze runtime: 1 x 4.814 ms = 4.814 ms
Real runtime: 1 x 6.786 ms = 6.786 ms (timed on 208 iterations)
----------------------------------------
SELECT * FROM test WHERE value='1'
Index Scan using test_value on test (cost=0.00..8.40 rows=1 width=13) (actual time=0.014..0.015 rows=1 loops=1)
  Index Cond: (value = '1'::text)
Explain Analyze runtime: 300 x 0.032 ms = 9.600 ms
Real runtime: 300 x 0.149 ms = 44.843 ms (timed on 31251 iterations)

Now if we ask for 300 random rows out of the million in the table, which is a lot more likely situation...

SELECT * FROM test WHERE value IN <300 values>
Bitmap Heap Scan on test (cost=1225.83..2317.08 rows=300 width=13) (actual time=4.516..4.945 rows=300 loops=1)
  Recheck Cond: (value = ANY ('{<300 values>}'::text[]))
-> Bitmap Index Scan on test_value (cost=0.00..1225.75 rows=300 width=0) (actual time=4.451..4.451 rows=300 loops=1)
        Index Cond: (value = ANY ('{<300 values>}'::text[]))
Explain Analyze runtime: 1 x 5.034 ms = 5.034 ms
Real runtime: 1 x 7.278 ms = 7.278 ms (timed on 199 iterations)
----------------------------------------
SELECT t.* FROM test t, ( VALUES <300 values> ) AS v WHERE t.value=v.column1 Nested Loop (cost=0.00..2447.27 rows=300 width=13) (actual time=0.046..5.503 rows=300 loops=1) -> Values Scan on "*VALUES*" (cost=0.00..3.75 rows=300 width=32) (actual time=0.001..0.200 rows=300 loops=1) -> Index Scan using test_value on test t (cost=0.00..8.13 rows=1 width=13) (actual time=0.016..0.016 rows=1 loops=300)
        Index Cond: (t.value = "*VALUES*".column1)
Explain Analyze runtime: 1 x 5.625 ms = 5.625 ms
Real runtime: 1 x 7.572 ms = 7.572 ms (timed on 178 iterations)

Doing a query per row is a lot slower than it appears in EXPLAIN ANALYZE when you add all the overhead. In fact it sucks. That was ecpected of course. I could tell you about when I got this php/ mysql website where the previous developer's idea of a join was a select in a PHP for() loop because "joins are slow". It was hell.

I tried to query 1, 2, 10, 1000, 10k and 100k random rows (thats abuse) and : - one query per row is (obviously) the slowest unless you select only one row - IN reverts to a single index scan if it contains one value, else it always uses a bitmap scan - VALUES join uses nested loop until about 100k rows where it switches to hash join

I also tried on a real table with many columns. In that case, there are less tuples per page, and bitmap scans are more efficient than nested loops, so IN wins.
	So, IN() does not turn to crap anymore like it used to !

However, if you are asking for this question because you don't want to use temporary tables, it seems that temp tables have been upgraded so they are now usable even in a webpage (well, not ALL webpages, but that big nasty slow webpage, you know which one I'm talking about).

	
	







---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

              http://www.postgresql.org/docs/faq




[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux