On Sat, Oct 18, 2014 at 6:01 PM, Tomas Vondra <tv@xxxxxxxx> wrote: > Hmmm. I have 0 experience with handling patents and related issues. Any > idea how to address that? Well there's no real way to address it. But to summarize: 1) We should not go searching for patents, knowing that something is patented increases your liability if you end up infringing on it 2) You should consult your lawyer for advise which he can give under attorney-client privilege and not expose you to such liability. 3) The whole patent system is fundamentally broken and serves only to protect incumbents from innovative competitors :( I think realistically software patents are so vague and cover so many basic algorithms which are obvious to anyone in the field because they're part of the basic knowledge taught in every algorithms course. So Postgres is probably infringing on hundreds of patents which would all easily be declared invalid if the owners ever sought to use them to attack widely used free software. That doesn't mean we should be specifically seeking out specific algorithms that are known to have been published in papers coming out of major proprietary database vendors though. That just seems like asking for trouble. We should be looking for solutions that seem obvious to us or were published 17+ years ago or were published by people who specifically claim they're not patented. > I think this will be very tricky, and in fact it may make the estimates > much worse easily, because all the algorithms assume random sampling. Well this is where Josh and I agree but come to different conclusions. He wants to use more of each block so he can take larger samples in the hopes it will produce better statistics. I want to use more of each block so we can continue to take rigorously justified sample sizes but without having to read such a large portion of the table. Either way our current sampling method isn't meeting anyone's needs. It requires you to do copious amounts of I/O which makes running analyze after an upgrade or data load a major contributor to your outage window. > > For example the ndistinct estimator Yes, well, the ndistinct estimator just sucks. It's always going to suck. Unless it has a chance to see every row of the table there's absolutely no way it's ever going to not suck. Any attempt to estimate ndistinct from a sample of any size is going to suck. That's just the way it is. Our sample sizes are based on the size needed to build the histogram. That requires only a fairly small and slow growing sample though our block based sampling inflates the amount of I/O that leads to. ndistinct and MCV are a different kind of problem that would require a proportional sample where the sample needs to grow linearly as the table grows. To get a good estimate of ndistinct requires a large enough proportion to effectively require reading the whole table. To get decent MCV stats only requires a smaller proportion but it's still a proportion that grows linearly which is going to be impractical for large data. There are two strategies for dealing with ndistinct that I can see here. Either a) We decide that to estimate ndistinct we need to scan the entire table and just make that a separate type of ANALYZE that you run less frequently. Basically this is what we did with VACUUM and indeed we could even invest in infrastructure like the FSM which allows you to process only the changed pages so it can be incrementally. Or we decide gathering periodic statistics isn't the way to handle ndistinct and instead watch the incoming inserts and outgoing deletes and keep the statistic up to date incrementally. That can be done though obviously it's a bit tricky. But there's tons of published research on updating database statistics incrementally -- which brings us back to patents... > I think the 'minimal' stats (when we have just '=' for the type) behaves > like this, but fixing it by switching to a two-pass approach should not > be that difficult (but would cost a few more CPU cycles). > > Or do you suggest that even the scalar MCV algorithm behaves has this > bias issue? I doubt that, because the MCV works with an array sorted by > number of occurences, so the position within the table is irrelevant. Hum. I'll have to reconstruct the tests I was doing back then and see what was going on. I never really understand what was causing it. What I observed was that if I increased the number of rows read per sampled block then the MCV was more and more dominated by the rows found early in the table. This was in a column where the values were actually uniformly distributed so there should have been only any values which happened to be sampled substantially more than the mean frequency. They were integers though so should have been covered by the sorting approach. > I don't see why the counting bloom filter would be necessary, in a two > pass approach? I guess I was imagining it was not keeping all the values in memory for at the same time. Isn't that the whole point of the lossy = algorithm? But now that I think of it I don't understand why that algorithm is needed at all. -- greg -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance