On Mon, 6 Jan 2020 at 17:38, Israel Brewster <ijbrewster@xxxxxxxxxx> wrote:
Sure. But I feel we are getting a bit off track. Optimizing the runtime of the update is great, but this is a one-off (hopefully) event. I want to accomplish it as quickly as possible, of course, but at the same time it doesn’t make sense to spend a lot of time optimizing every component of the query. The main purpose of the question was honestly for my sanity, to reduce the likelihood of having it run for several hours only to error out due to bad data or whatever and have to start over from the top. Running in parallel simply seemed to be a no-brainer option to make it go quicker, assuming CPU bound updating. Optimizations that are going to take work are probably not worth it. We can wait for the data to be updated.
It sounds like you're in a decent place on this, and that you have done a pretty apropos amount of exploration of the matter.
I was pleased to hear that you have the idempotency of the updates well in hand, and that the application can cope with the degree of out-of-sync that things will temporarily be.
The estimate of 10h to update the data doesn't surprise me; that's long enough that it sure seems tempting to do the work in pieces so that you don't have your whole set of application data locked for 10h.
I'd be inclined to call this "enough attention" for a one-off event.
I'll poke at the trigger aspect a wee bit; if the trigger function does a one-tuple-at-a-time handling of things, so that it fires 50M times, you might get a substantial speedup by replacing that with an equivalent set operation that processes a few thousand tuples at a time. That said, if you're happy with the process running 10h, it's not worth unpeeling the extra testing needed to ensure identical end states.
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"
question, "How would the Lone Ranger handle this?"