Re: Disk wait problem...

pf@xxxxxxxxxxx · Thu, 26 Oct 2023 15:50:16 -0400

Hi Peter,

All of the following is based on using SQL_workbench/J (WB)  (Mageia Linux
has not provided a viable pgAdmin4 package); WB is setup to autoload the
table row count and 10 rows. I'm sticking to one set of files where they
are all roughly 33.6M rows.

I've been doing a lot of digging and found more interesting info...

At first, I thought the disk was in disk wait the entire time of the
delay; this _appeared_ to be the case since the system journal logged
everything about each query, with every query looking the same; no errors.

I was going to provide some logs; but suffice it to say that the logs,
combined with running iostat every second has been enlightening...

Every time I switch from one table to another, sometimes the results
display in:
* less than a half-second
* one or more seconds
* up to 5+ minutes.
So far, none has failed to eventually return; just l..o..n..g delays...

This morning, I've been using iostat:
$ while true; do echo -n "$(date +%H:%M:%S)  "; iostat /dev/nvme1n1 |
grep nvme1n1; sleep 1; done

When the response is fast; I see iostat report ~225000 bytes read. On the
other hand when the query takes a long time to respond, I see the read
bytes spread over time...  a little every second...  So while htop
appears to report a steady D (disk wait) status, I now suspect the short
transfers are fast and most of the time, htop is going to catch the D
state.

It definitely appears to be a problem with the 4TB NVMe SSD drive which
utilities say it still has 95% life left[1].  The drive is reporting:
Error Information Log Entries:      1,177
which is growing, so I'll be digging into that.  It also reports:

Error Information (NVMe Log 0x01, 16 of 63 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS  Message
  0       1177     0  0xc002  0x4004      -            0     1     -  Invalid Field in Command
  1       1176     0  0xb001  0x4004      -            0     1     -  Invalid Field in Command
  2       1175     0  0x801a  0x4004      -            0     1     -  Invalid Field in Command
...
This list is not growing, so I'll look into it later than sooner.

A few moments ago, I decided to switch to the backup which resides on a
17TB mechanical platter drive which is running great.

[1] SSD drives wear out with usage; think of it as a vehicle tire which
slowly wears out.

Also found that there is a firmware update available from:
https://www.seagate.com/support/downloads/

No idea if this will solve the problem; but...

Thanks for your suggestions.

Thought I saw there may be instrumentation within PG which I can access;
suggestions welcome if so...

Cheers,
Pierre

On Thu, 26 Oct 2023 11:50:00 +0200 Peter J. Holzer wrote:

>On 2023-10-23 16:31:30 -0700, Adrian Klaver wrote:
>> On 10/23/23 14:55, pf@xxxxxxxxxxx wrote:  
>> > Next, I made a query list for 30 tables, turned on timing and pasted the
>> > queries.  Pasting results inline:
>> > ncsbe=# select count(*) from ncvhis_2012_12_06;
>> > select count(*) from ncvhis_2016_12_03;
>> > select count(*) from ncvhis_2020_08_01;
>> > select count(*) from ncvhis_2020_10_31;  
>[...]
>> > Time: 72404.786 ms (01:12.405)
>> > Time: 301.246 ms
>> > Time: 409.974 ms
>> > Time: 363.836 ms
>> > Time: 351.167 ms
>> > Time: 348.378 ms
>> > Time: 348.712 ms
>> > Time: 351.901 ms
>> > Time: 487.837 ms
>> > Time: 40987.826 ms (00:40.988)
>> > Time: 76964.281 ms (01:16.964)
>> > Time: 483.329 ms
>> > Time: 18919.267 ms (00:18.919)
>> > Time: 50896.978 ms (00:50.897)
>> > Time: 25784.000 ms (00:25.784)
>> > Time: 45407.599 ms (00:45.408)
>> > Time: 75171.609 ms (01:15.172)
>> > Time: 88871.004 ms (01:28.871)
>> > Time: 128565.127 ms (02:08.565)
>> > Time: 210248.222 ms (03:30.248)
>> > Time: 246609.561 ms (04:06.610)   4 minutes!  ;p
>> > Time: 166406.283 ms (02:46.406)
>> > Time: 291483.538 ms (04:51.484)  nearly 5 minutes ;p ;p
>> > Time: 2269.961 ms (00:02.270)
>> > Time: 413.219 ms
>> > Time: 433.817 ms
>> > Time: 395.844 ms
>> > Time: 7251.637 ms (00:07.252)
>> > Time: 384.328 ms
>> > Time: 384.887 ms
>> > 
>> > I don't see a pattern in the above; matches the randomness I saw using
>> > WB...  
>
>It would be interesting to see whether there's a difference in plan
>between the slow and the fast queries.
>
>If your experiment above is repeatable, you can just prefix every
>query with 
>    explain (analyze, buffers)
>
>
>explain (analyze, buffers) select count(*) from ncvhis_2016_12_03;
>explain (analyze, buffers) select count(*) from ncvhis_2020_08_01;
>explain (analyze, buffers) select count(*) from ncvhis_2020_10_31;
>...
>
>If it is hard to reproduce (happens only sometimes), the best way to
>catch is probably to enable auto_explain
><https://www.postgresql.org/docs/current/auto-explain.html>, possibly
>with the auto_explain.log_analyze option and check the logs after it
>happened. (auto_explain is generally a good idea, IMHO).
>
>
>> > Before the above finished, I issued this command on another konsole...
>> > 
>> > $ while true; do ls -l > /tmp/ll; date; done  
>
>This is unlikely to generate noticeable disk waits. The current
>directory will be in the cache after the first ls and the writes happen
>asynchroneously.
>
>        hp
>