On Mon, Sep 12, 2016 at 7:03 AM, Vinicius Segalin <vinisegalin@xxxxxxxxx> wrote:
Hi everyone,I'm trying to find a way to predict query runtime (I don't need to be extremely precise). I've been reading some papers about it, and people are using machine learning to do so. For the feature vector, they use what the DBMS's query planner provide, such as operators and their cost. The thing is that I haven't found any work using PostgreSQL, so I'm struggling to adapt it.My question is if anyone is aware of a work that uses machine learning and PostgreSQL to predict query runtime, or maybe some other method to perform this.
I don't know about machine learning, but if there were some way to get the planner to tell you predicted cost in terms of a breakdown of how many multiples of each *_cost factor (rather than only a grand total which is what it does now), then it would be fairly easy to combine that with wall times from log_duration and do a simple linear regression.
I suspect the result would be that seq_page_cost and random_page_cost would have huge uncertainties on them. And since pretty much every query has non-zero predicted values for at least one of those, the huge uncertainties would then pollute all the rest of the fitted values as well. Perhaps that is where the machine learning would come in?
Another issue is the predicted costs are only meant to choose between different plans, not to predict overall wall time. Some parts of the planner only have one way to do something, and so doesn't bother to compute a cost for that as there is no choice to be made. This would leave glaring holes in the estimates (particularly for updates)
But to get that data out would require quite a bit of tedious altering of the planner code, and then you would have to find people willing to run that altered code on real world databases with a high level of logging to gather the data. (I suspect that gathering data from only toy databases would not be very useful).
Cheers,
Jeff