On Tue, 2006-01-10 at 17:21 -0500, Tom Lane wrote: > Simon Riggs <simon@xxxxxxxxxxxxxxx> writes: > > I think its OK to use the MCV, but I have a problem with the current > > heuristics: they only work for randomly generated strings, since the > > selectivity goes down geometrically with length. > > We could certainly use a less aggressive curve for that. You got a > specific proposal? non-left anchored LIKE is most likely going to be used with unstructured, variable length data - else we might use SUBSTRING instead. My proposal would be to assume that LIKE is acting on human language text data. I considered this a while back, but wrote it off in favour of dynamic sampling - but it's worth discussing this to see whether we can improve on things without that. Here's one of the links I reviewed previously: http://www.ling.lu.se/persons/Joost/Texts/studling.pdf Sigurd et al [2004] This shows word frequency distribution peaks at 3 letter/2 phoneme words, then tails off exponentially after that. Clearly when search string > 3 then the selectivity must tail off exponentially also, since we couldn't find words shorter than the search string itself. The search string might be a phrase, but it seems reasonable to assume that phrases also drop off in frequency according to length. It is difficult to decide what to do at len=2 or len=3, and I would be open to various thoughts, but would default to keeping like_selectivity as it is now. Sigurd et al show that word length tails off at 0.7^Len beyond Len=3, so selectivity FIXED_CHAR_SEL should not be more than 0.7, but I see no evidence for it being as low as 0.2 (from the published results). For simplicity, where Len > 3, I would make the tail off occur with factor 0.5, rather than 0.2. We could see a few more changes from those results, but curbing the aggressive tail off would be a simple and easy act. Best Regards, Simon Riggs