>-----Original Message----- >From: pgsql-general-owner@xxxxxxxxxxxxxx >[mailto:pgsql-general-owner@xxxxxxxxxxxxxx] On Behalf Of Joris >Dobbelsteen >Sent: Monday, 25 February 2008 17:08 >To: Tom Lane >Cc: Gregory Stark; Scott Marlowe; pgsql-general@xxxxxxxxxxxxxx >Subject: Re: Planner: rows=1 after "similar to" >where condition. > >>-----Original Message----- >>From: Tom Lane [mailto:tgl@xxxxxxxxxxxxx] >>Sent: Monday, 25 February 2008 16:34 >>To: Joris Dobbelsteen >>Cc: Gregory Stark; Scott Marlowe; pgsql-general@xxxxxxxxxxxxxx >>Subject: Re: Planner: rows=1 after "similar to" >>where condition. >> >>"Joris Dobbelsteen" <Joris@xxxxxxxxxxxxxxxxxxxxx> writes: >>> "Bitmap Heap Scan on log_syslog syslog (cost=11168.32..16988.84 >>> rows=1 >>> width=221) (actual time=11145.729..30067.606 rows=212 loops=1)" >>> " Recheck Cond: (((program)::text = 'amavis'::text) AND >>> ((facility)::text = 'mail'::text))" >>> " Filter: (((priority)::text = 'notice'::text) AND ((text)::text ~ >>> '***:^(?:amavis\\[[0-9]+\\]: \\([0-9]+-[-0-9]+\\) Passed >>[A-Za-z0-9]+, >>> [][0-9\\.]* <[^<>]+> -> <[^<>]+>, Message-ID: <[^<>]+>, >>> (Resent-Message-ID: <[^<>]+>, |)mail.id: [^ ,]+, Hits: [-+0-9\\.,]+, >>> queued.as: [^ ,]+, [0-9]+ ms)$'::text))" >> >>It's not too surprising that you'd get a small selectivity >estimate for >>such a long regexp; the default estimate is just based on the >amount of >>fixed text in the pattern, and you've got a lot. >> >>If you increase the stats target for the column to 100 or >more then it >>will try actually applying the regexp to all the histogram entries. >>That might or might not give you a better estimate. > >I will try that, expect result back within a few days (have it >collect some better sample set). Unfortunally the regex is not >so much for narrowing down the selection, but rather >guarenteeing the format of the messages. >You seem to consider the common case differently, and I can >agree for most part. Unfortunally my use-case is different >from the expected. That said, might a less aggressive >selectivity estimation for long strings work better in the common case? A new test case (I did a fresh VACUUM ANALYZE with your statistics for text set to 100): Arround 5288 rows out of 4.3 Million match. "Bitmap Heap Scan on log_syslog syslog (cost=17777.94..53522.27 rows=1 width=226) (actual time=41661.354..92719.083 rows=5288 loops=1)" " Recheck Cond: (((program)::text = 'amavis'::text) AND ((facility)::text = 'mail'::text))" " Filter: (((priority)::text = 'notice'::text) AND ((text)::text ~ '***:^(?:amavis\\[[0-9]+\\]: \\([0-9]+-[-0-9]+\\) Passed [A-Za-z0-9]+, [][0-9\\.]* <[^<>]+> -> <[^<>]+>, Message-ID: <[^<>]+>, (Resent-Message-ID: <[^<>]+>, |)mail.id: [^ ,]+, Hits: [-+0-9\\.,]+, queued.as: [^ ,]+, [0-9]+ ms)$'::text))" " -> BitmapAnd (cost=17777.94..17777.94 rows=15279 width=0) (actual time=4641.009..4641.009 rows=0 loops=1)" " -> Bitmap Index Scan on "IX_log_syslog_program" (cost=0.00..2908.86 rows=113370 width=0) (actual time=2913.718..2913.718 rows=113897 loops=1)" " Index Cond: ((program)::text = 'amavis'::text)" " -> Bitmap Index Scan on "IX_log_syslog_facility" (cost=0.00..14868.57 rows=591426 width=0) (actual time=1715.591..1715.591 rows=586509 loops=1)" " Index Cond: ((facility)::text = 'mail'::text)" "Total runtime: 92738.389 ms" Unfortunally, Tom, it seems the data varies to much and is not included in the histogram. Probably the data varies too much. In this case, a regex NOT for selection but rather for forcing the input format should be done differently. My construction with the regex as "substring()" construction and a "WHERE substring() IS NOT NULL" seems to give a better estimate in these cases. The result seems equivalent. "Bitmap Heap Scan on log_syslog syslog (cost=17783.78..53966.33 rows=5844 width=226) (actual time=59095.076..110913.152 rows=5295 loops=1)" " Recheck Cond: (((program)::text = 'amavis'::text) AND ((facility)::text = 'mail'::text))" " Filter: (((priority)::text = 'notice'::text) AND ("substring"((text)::text, 'amavis\\[[0-9]+\\]: \\([0-9]+-[-0-9]+\\) Passed \\"[A-Za-z0-9]+\\", [][0-9.]* <[^<>]+> -> <[^<>]+>, Message-ID: <[^<>]+>, (Resent-Message-ID: <[^<>]+>, |)mail_id: [^ ,]+, Hits: [-+0-9.,]+, queued_as: [^ ,]+, [0-9]+ ms'::text, '\\'::text) IS NOT NULL))" " -> BitmapAnd (cost=17783.78..17783.78 rows=15279 width=0) (actual time=4003.657..4003.657 rows=0 loops=1)" " -> Bitmap Index Scan on "IX_log_syslog_program" (cost=0.00..2908.86 rows=113370 width=0) (actual time=1652.278..1652.278 rows=113939 loops=1)" " Index Cond: ((program)::text = 'amavis'::text)" " -> Bitmap Index Scan on "IX_log_syslog_facility" (cost=0.00..14868.57 rows=591426 width=0) (actual time=2339.943..2339.943 rows=586653 loops=1)" " Index Cond: ((facility)::text = 'mail'::text)" "Total runtime: 110921.978 ms" Note: few added rows in second run is due to the fact that this is a live table that receives input continuesly. Concluding: Your estimator is really great and seems to give pretty good estimates! Except for regular expressions, which seem more tricky in this regard. A good note might be to NOT use regex for forcing a format, but rather quite strict selection only. I believe your estimator might be a bit too rough for long regexes. - Joris ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org/