Re: Planner cost adjustments

Daniel Begin <jfd553@xxxxxxxxxxx> · Fri, 19 Jun 2015 12:05:29 -0400

Here is a last follow-up on adjusting the planner costs calculation 

1 - Francisco, you provide me with a clue I was looking for at this point!

I learned enough PLpgsql over the last week to reprogram all my processes (not always elegant but it works). By processing individual records using loop, I will be able to move forward with my work and save individual results as they are processed. 

Furthermore, it is now easier to predict jobs completion times, since the DB uses similar plans for each record. I did a test on a stratified sampling of records and I expect the jobs will run for about two weeks (contrarily to months in previous estimates!-)

2 - PLpgsql helped me to bypass the problem. As PT suggested, I will try to setup a test case that demonstrates the problem and post it to the developers' list so they might figure out what to do in such situation.

Thanks to all,
Daniel

-----Original Message-----
From: pgsql-general-owner@xxxxxxxxxxxxxx [mailto:pgsql-general-owner@xxxxxxxxxxxxxx] On Behalf Of Francisco Olarte
Sent: June-11-15 10:05
To: Daniel Begin
Cc: Bill Moran; pgsql-general@xxxxxxxxxxxxxx; Tomas Vondra; Melvin Davidson
Subject: Re:  Planner cost adjustments

Hi Daniel:

On Thu, Jun 11, 2015 at 2:38 PM, Daniel Begin <jfd553@xxxxxxxxxxx> wrote:
.....
> The remaining problem seems related to the statistics of some large tables.
> On one hand, I might increase the statistic target for these tables to 
> 500, or even to 1000 and look at the results (but I have doubts it 
> will help). On the other hand, I could simply force enable_seqscan to 
> OFF for queries dealing with them.
>
> Still not sure about the best solution but the problem is definitely 
> narrower :-)

One somehow unrelated point. IIRC your problems where related to queries doing joins with selected sets of indexed IDs on a smaller table which then have to be looked up on some very large tables. ( I'm not able to follow too well which is which, users on changesets, I'm a bit lost ). Given your runtimes are always high ( in the seconds range, so it seems wire speed / latencies are not too much of an issue
) and that selectivity estimates on huge tables are always problematic and may be thwarting your plans you may be able to get faster results splitting your query.

 If I read your plans correctly, that would be selecting your 600 users in one query and then preparing the changeset query for a single user_id, which should be indexed, and looping it for every user. Given current machines can easily send-receive 600 queries in a second it may lead to a simpler solution. This mean you're using the DB as a somehow inteligent plain old indexed file, but sometimes this is the simpler approach ( heck, some of my code uses algorithms from the tape era as they were the faster way I could do it ).

I needed to do this in one of my programs, the optimizer kept selecting bad plans so I did the first query, held the results in memory, and then prepared and executed in a loop from the app, my query was selecting a couple thousand values from submillion table, and joining with a multimillion one, getting a couple hundreds matches per original value. Splitting it made the queries on the big table always run indexed and fast ( and as a side bonus avoided duplicating the parts of the first record in the wire a hundred times, which was nice since the short table was wide and I only needed 3 short fields from the second one, and that made the first query run at wire speed and the second at disk speed ).

Francisco Olarte.

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general