Re: Planner: rows=1 after "similar to" where condition.

"Joris Dobbelsteen" <Joris@xxxxxxxxxxxxxxxxxxxxx> · Tue, 4 Mar 2008 22:32:46 -0000

>-----Original Message-----
>From: pgsql-general-owner@xxxxxxxxxxxxxx 
>[mailto:pgsql-general-owner@xxxxxxxxxxxxxx] On Behalf Of Joris 
>Dobbelsteen
>Sent: Monday, 25 February 2008 17:08
>To: Tom Lane
>Cc: Gregory Stark; Scott Marlowe; pgsql-general@xxxxxxxxxxxxxx
>Subject: Re:  Planner: rows=1 after "similar to" 
>where condition. 
>
>>-----Original Message-----
>>From: Tom Lane [mailto:tgl@xxxxxxxxxxxxx]
>>Sent: Monday, 25 February 2008 16:34
>>To: Joris Dobbelsteen
>>Cc: Gregory Stark; Scott Marlowe; pgsql-general@xxxxxxxxxxxxxx
>>Subject: Re:  Planner: rows=1 after "similar to" 
>>where condition. 
>>
>>"Joris Dobbelsteen" <Joris@xxxxxxxxxxxxxxxxxxxxx> writes:
>>> "Bitmap Heap Scan on log_syslog syslog  (cost=11168.32..16988.84
>>> rows=1
>>> width=221) (actual time=11145.729..30067.606 rows=212 loops=1)"
>>> "  Recheck Cond: (((program)::text = 'amavis'::text) AND 
>>> ((facility)::text = 'mail'::text))"
>>> "  Filter: (((priority)::text = 'notice'::text) AND ((text)::text ~
>>> '***:^(?:amavis\\[[0-9]+\\]: \\([0-9]+-[-0-9]+\\) Passed
>>[A-Za-z0-9]+,
>>> [][0-9\\.]* <[^<>]+> -> <[^<>]+>, Message-ID: <[^<>]+>,
>>> (Resent-Message-ID: <[^<>]+>, |)mail.id: [^ ,]+, Hits: [-+0-9\\.,]+,
>>> queued.as: [^ ,]+, [0-9]+ ms)$'::text))"
>>
>>It's not too surprising that you'd get a small selectivity 
>estimate for 
>>such a long regexp; the default estimate is just based on the 
>amount of 
>>fixed text in the pattern, and you've got a lot.
>>
>>If you increase the stats target for the column to 100 or 
>more then it 
>>will try actually applying the regexp to all the histogram entries.
>>That might or might not give you a better estimate.
>
>I will try that, expect result back within a few days (have it 
>collect some better sample set). Unfortunally the regex is not 
>so much for narrowing down the selection, but rather 
>guarenteeing the format of the messages.
>You seem to consider the common case differently, and I can 
>agree for most part. Unfortunally my use-case is different 
>from the expected. That said, might a less aggressive 
>selectivity estimation for long strings work better in the common case?

A new test case (I did a fresh VACUUM ANALYZE with your statistics for
text set to 100):
Arround 5288 rows out of 4.3 Million match.

"Bitmap Heap Scan on log_syslog syslog  (cost=17777.94..53522.27 rows=1
width=226) (actual time=41661.354..92719.083 rows=5288 loops=1)"
"  Recheck Cond: (((program)::text = 'amavis'::text) AND
((facility)::text = 'mail'::text))"
"  Filter: (((priority)::text = 'notice'::text) AND ((text)::text ~
'***:^(?:amavis\\[[0-9]+\\]: \\([0-9]+-[-0-9]+\\) Passed [A-Za-z0-9]+,
[][0-9\\.]* <[^<>]+> -> <[^<>]+>, Message-ID: <[^<>]+>,
(Resent-Message-ID: <[^<>]+>, |)mail.id: [^ ,]+, Hits: [-+0-9\\.,]+,
queued.as: [^ ,]+, [0-9]+ ms)$'::text))"
"  ->  BitmapAnd  (cost=17777.94..17777.94 rows=15279 width=0) (actual
time=4641.009..4641.009 rows=0 loops=1)"
"        ->  Bitmap Index Scan on "IX_log_syslog_program"
(cost=0.00..2908.86 rows=113370 width=0) (actual time=2913.718..2913.718
rows=113897 loops=1)"
"              Index Cond: ((program)::text = 'amavis'::text)"
"        ->  Bitmap Index Scan on "IX_log_syslog_facility"
(cost=0.00..14868.57 rows=591426 width=0) (actual
time=1715.591..1715.591 rows=586509 loops=1)"
"              Index Cond: ((facility)::text = 'mail'::text)"
"Total runtime: 92738.389 ms"

Unfortunally, Tom, it seems the data varies to much and is not included
in the histogram. Probably the data varies too much. In this case, a
regex NOT for selection but rather for forcing the input format should
be done differently.
My construction with the regex as "substring()" construction and a
"WHERE substring() IS NOT NULL" seems to give a better estimate in these
cases. The result seems equivalent.

"Bitmap Heap Scan on log_syslog syslog  (cost=17783.78..53966.33
rows=5844 width=226) (actual time=59095.076..110913.152 rows=5295
loops=1)"
"  Recheck Cond: (((program)::text = 'amavis'::text) AND
((facility)::text = 'mail'::text))"
"  Filter: (((priority)::text = 'notice'::text) AND
("substring"((text)::text, 'amavis\\[[0-9]+\\]: \\([0-9]+-[-0-9]+\\)
Passed \\"[A-Za-z0-9]+\\", [][0-9.]* <[^<>]+> -> <[^<>]+>, Message-ID:
<[^<>]+>, (Resent-Message-ID: <[^<>]+>, |)mail_id: [^ ,]+, Hits:
[-+0-9.,]+, queued_as: [^ ,]+, [0-9]+ ms'::text, '\\'::text) IS NOT
NULL))"
"  ->  BitmapAnd  (cost=17783.78..17783.78 rows=15279 width=0) (actual
time=4003.657..4003.657 rows=0 loops=1)"
"        ->  Bitmap Index Scan on "IX_log_syslog_program"
(cost=0.00..2908.86 rows=113370 width=0) (actual time=1652.278..1652.278
rows=113939 loops=1)"
"              Index Cond: ((program)::text = 'amavis'::text)"
"        ->  Bitmap Index Scan on "IX_log_syslog_facility"
(cost=0.00..14868.57 rows=591426 width=0) (actual
time=2339.943..2339.943 rows=586653 loops=1)"
"              Index Cond: ((facility)::text = 'mail'::text)"
"Total runtime: 110921.978 ms"

Note: few added rows in second run is due to the fact that this is a
live table that receives input continuesly.

Concluding:
Your estimator is really great and seems to give pretty good estimates!
Except for regular expressions, which seem more tricky in this regard.

A good note might be to NOT use regex for forcing a format, but rather
quite strict selection only. I believe your estimator might be a bit too
rough for long regexes.

- Joris

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

               http://archives.postgresql.org/