On Thu, Feb 03, 2011 at 09:05:52PM -0500, Robert Haas wrote: > On Thu, Feb 3, 2011 at 8:37 PM, <david@xxxxxxx> wrote: > > On Thu, 3 Feb 2011, Robert Haas wrote: > > > >> On Thu, Feb 3, 2011 at 7:39 PM, ?<david@xxxxxxx> wrote: > >>>> > >>>> Yeah, but you'll be passing the entire table through this separate > >>>> process that may only need to see 1% of it or less on a large table. > >>>> If you want to write the code and prove it's better than what we have > >>>> now, or some other approach that someone else may implement in the > >>>> meantime, hey, this is an open source project, and I like improvements > >>>> as much as the next guy. ?But my prediction for what it's worth is > >>>> that the results will suck. ?:-) > >>> > >>> I will point out that 1% of a very large table can still be a lot of disk > >>> I/O that is avoided (especially if it's random I/O that's avoided) > >> > >> Sure, but I think that trying to avoid it will be costly in other ways > >> - you'll be streaming a huge volume of data through some auxiliary > >> process, which will have to apply some algorithm that's very different > >> from the one we use today. ?The reality is that I think there's little > >> evidence that the way we do ANALYZE now is too expensive. ?It's > >> typically very cheap and works very well. ?It's a bit annoying when it > >> fires off in the middle of a giant data load, so we might need to > >> change the time of it a little, but if there's a problem with the > >> operation itself being too costly, this is the first I'm hearing of > >> it. ?We've actually worked *really* hard to make it cheap. > > > > I could be misunderstanding things here, but my understanding is that it's > > 'cheap' in that it has little impact on the database while it is running. > > I mean that it's cheap in that it usually takes very little time to complete. > > > the issue here is that the workflow is > > > > load data > > analyze > > start work > > > > so the cost of analyze in this workflow is not "1% impact on query speed for > > the next X time", it's "the database can't be used for the next X time while > > we wait for analyze to finish running" > > OK. > > > I don't understand why the algorithm would have to be so different than > > what's done today, surely the analyze thread could easily be tweaked to > > ignore the rest of the data (assuming we don't have the thread sending the > > data to analyze do the filtering) > > If you want to randomly pick 10,000 rows out of all the rows that are > going to be inserted in the table without knowing in advance how many > there will be, how do you do that? Maybe there's an algorithm, but > it's not obvious to me. But mostly, I question how expensive it is to > have a second process looking at the entire table contents vs. going > back and rereading a sample of rows at the end. I can't remember > anyone ever complaining "ANALYZE took too long to run". I only > remember complaints of the form "I had to remember to manually run it > and I wish it had just happened by itself". > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company Probably doomed to be shot down, but since you are effectively inline, you could sample assuming a range of table row counts. Start at the size of a table where random (index) lookups are faster than a sequential scan and then at appropriate multiples, 100x, 100*100X,... then you should be able to generate appropriate statistics. I have not actually looked at how that would happen, but it would certainly allow you to process far, far fewer rows than the entire table. Regards, Ken -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance