Re: UPDATE many records

Justin <zzzzz.graf@xxxxxxxxx> · Mon, 6 Jan 2020 21:28:08 -0500

system monitor i like is glances,  have to have python installed to run,  it can run in command console or present a web interface.  its a very nice to  get a quick detail few what is going on
https://nicolargo.github.io/glances/

just monitoring the system CPU utilization does not give enough information to state if the system is CPU bound or IO bound or network IO bound.

on simple selects assuming the data is in shared buffers its trivial event,  if the select has to go to disk to get the data then its not trivial especially if its a big table.  One thing that can happen is the update and the selects in the trigger are pushing data in and out of shared memory causing the disk to thrash which explain does not report.

 Explain does not peer into function, triggers, etc   select myfunction(),  column list, from mytable.  or update mytable set column = myfunction() ,  it will not report what happened inside just the total time it took to run.
looking at the explain the Update only took 0.6 second while the trigger took 6.5

rotate_tilt_data: time=6560.401 calls=9299

I'm a little confused is this trigger function going to be around after this update?  If yes it should be worth looking into

If its a one time run or every 12 months who cares,  Start the update on friday night,  go in on Saturday to check it

On Mon, Jan 6, 2020 at 5:38 PM Israel Brewster <ijbrewster@xxxxxxxxxx> wrote:

On Jan 6, 2020, at 12:49 PM, Justin <zzzzz.graf@xxxxxxxxx> wrote:

What was the HD wait time ?  What tool is being use to monitor the server resources??

No idea on the HD wait time - how would I best monitor that? That said, this machine does have NVMe drives, so the speed should be fairly high/wait time fairly low. It’s also running as a VM, which could affect things, but is a bit of a moot point as far as this update goes. As far as monitoring server resources, I was just using top.

It appears based on this information there is allot more going on than a simple Update command

Depending on your definition of “simple update” of course, very true. As I stated in the original message, the actual update value is the result of a function. The psql command is a simple update, but the function does a bit of stuff (primarily trigonometry). According to the EXPLAIN ANALYZE, about .7 ms of stuff per record, which of course is most of the runtime. It is entirely possible that the function could be optimized to run more quickly.

Moving code out of the trigger  probably not  going to improve performance, unless there is allot of code  that does not need to be processed for this update or code touching other tables

One SELECT query on another table to get some values I need to use for the calculation. No code that is not needed for the update. Given the nature of this bulk update, I *could* make a separate function that simply takes those values as parameters, since the same value will be applied to a lot of records. I’d just have to be careful about how I applied the update, so rows get processed with the correct values. I’m not convinced it would be worth it though - might shave a few hours off the total execution time (assuming that SELECT is expensive - EXPLAIN ANLYZE shows an index scan, on a table with only 12,761 rows, which seems to be about as simple as it gets), but I doubt it would be enough for me to feel comfortable simply running the update as one monolithic unit.

Study the trigger identify what has to run, pull that code out, then disable the trigger.    Move the necessary code to a new function for Updating.. 

Sure. But I feel we are getting a bit off track. Optimizing the runtime of the update is great, but this is a one-off (hopefully) event. I want to accomplish it as quickly as possible, of course, but at the same time it doesn’t make sense to spend a lot of time optimizing every component of the query. The main purpose of the question was honestly for my sanity, to reduce the likelihood of having it run for several hours only to error out due to bad data or whatever and have to start over from the top. Running in parallel simply seemed to be a no-brainer option to make it go quicker, assuming CPU bound updating. Optimizations that are going to take work are probably not worth it. We can wait for the data to be updated.

Thanks again!

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

On Mon, Jan 6, 2020 at 4:24 PM Israel Brewster <ijbrewster@xxxxxxxxxx> wrote:

On Jan 6, 2020, at 11:54 AM, Michael Lewis <mlewis@xxxxxxxxxxx> wrote:

I’m thinking it might be worth it to do a “quick” test on 1,000 or so records (or whatever number can run in a minute or so), watching the processor utilization as it runs. That should give me a better feel for where the bottlenecks may be, and how long the entire update process would take. I’m assuming, of course, that the total time would scale more or less linearly with the number of records.

I think that depends on how your identify and limit the update to those 1000 records. If it is using a primary key with specific keys in an array, probably close to linear increase because the where clause isn't impactful to the overall execution time. If you write a sub-query that is slow, then you would need to exclude that from the time. You can always run explain analyze on the update and rollback rather than commit.
So a test run on 9,299 records took about 7 seconds to complete (EXPLAIN ANALYZE output at https://explain.depesz.com/s/lIYn if it matters), during which time I did see a postmaster process consuming 100% CPU. Upping the test to 20,819 records took about 16.5 seconds, so that looks relatively linear to me. Also, CPU bound. So by my calculations, doing all 50M records would take around 10 hours.
One potentially significant note: most of the execution time is spent in a trigger. This trigger is actually what’s doing the REAL update that I need to happen. If it would make a difference, I could easily pull the trigger code out to a separate function that I just call directly (with triggers temporarily disabled). My thinking is that calling a function is calling a function, and the fact that it is currently called via a trigger rather than direct is of little consequence, but I’m willing to be corrected on that :-)

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145