Re: Disk wait problem...

Adrian Klaver <adrian.klaver@xxxxxxxxxxx> · Mon, 23 Oct 2023 16:31:30 -0700

On 10/23/23 14:55, pf@xxxxxxxxxxx wrote:
Please reply to list also.
Ccing the list for this post.

On Mon, 23 Oct 2023 12:44:56 -0700 Adrian Klaver wrote:

On 10/23/23 11:54, pf@xxxxxxxxxxx wrote:
Hi,

I have a 1.6TB database with over 330 tables on a 4TB NVMe SSD. All
tables are static (no updates); most in 8M and 33M row sizes. Queries have
been great, until recently.

Also attached is the relevant system journal entries for one query that
took 2 seconds and two more that took 62 and 52 seconds... Unlike the
above, these had 3 processes in disk-wait -- see attached screenshot...

What happens if you select the tables in the command line client psql?

Hi Adrian,

Sorry about everything inline; I think it saves time vs dealing with
attachments...

Summary:
* single large partition dedicated to PG
* only PG seems affected
* buffered reads are 6x slower when PG queries are in progress
* psql queries of 30  33M tables shows random results up to 4m51s
* successful psql queries below 500ms
* ls -l loop on another directory had no delays.

HTH...

The WB query was in disk-wait, so psql gave:
ncsbe=# select count(*) from ncvhis_2022_10_29;
FATAL:  terminating connection due to administrator command
server closed the connection unexpectedly
         This probably means the server terminated abnormally
         before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

The above query finished a second or two before the WB results
appeared; then this next query succeeded right after:
ncsbe=# select count(*) from
ncvhis_2022_10_29; count
----------
  33861176
(1 row)

Another table went into disk-wait long enough that I was able to query
the tmp directory on the same disk just fine.

Querying the same table; no error, canceled it:
ncsbe=# select count(*) from ncvhis_2022_07_23;
^CCancel request sent
ERROR:  canceling statement due to user request

While this table was still in disk-wait, tried another table which
returned quickly:
ncsbe=# select count(*) from ncvhis_2023_10_21;
   count
----------
  32983343
(1 row)

Eventually, the WB query did complete...

Next, I made a query list for 30 tables, turned on timing and pasted the
queries.  Pasting results inline:
ncsbe=# select count(*) from ncvhis_2012_12_06;
select count(*) from ncvhis_2016_12_03;
select count(*) from ncvhis_2020_08_01;
select count(*) from ncvhis_2020_10_31;
select count(*) from ncvhis_2020_12_05;
select count(*) from ncvhis_2020_12_26;
select count(*) from ncvhis_2021_03_06;
select count(*) from ncvhis_2021_06_12;
select count(*) from ncvhis_2022_07_23;
select count(*) from ncvhis_2022_10_15;
select count(*) from ncvhis_2022_10_22;
select count(*) from ncvhis_2022_10_29;
select count(*) from ncvhis_2022_11_05;
select count(*) from ncvhis_2022_11_12;
select count(*) from ncvhis_2022_11_19;
select count(*) from ncvhis_2022_11_26;
select count(*) from ncvhis_2022_12_03;
select count(*) from ncvhis_2022_12_10;
select count(*) from ncvhis_2022_12_17;
select count(*) from ncvhis_2022_12_24;
select count(*) from ncvhis_2022_12_31;
select count(*) from ncvhis_2023_01_08;
select count(*) from ncvhis_2023_01_14;
select count(*) from ncvhis_2023_01_21;
select count(*) from ncvhis_2023_01_28;
select count(*) from ncvhis_2023_02_04;
select count(*) from ncvhis_2023_02_11;
select count(*) from ncvhis_2023_02_18;
select count(*) from ncvhis_2023_02_27;
select count(*) from ncvhis_2023_03_04;
   count
----------
  31923950
(1 row)

Time: 72404.786 ms (01:12.405)
   count
----------
  29701168
(1 row)

Time: 301.246 ms
   count
----------
  32172845
(1 row)

Time: 409.974 ms
   count
----------
  32162748
(1 row)

Time: 363.836 ms
   count
----------
  34679651
(1 row)

Time: 351.167 ms
   count
----------
  35006711
(1 row)

Time: 348.378 ms
   count
----------
  35003995
(1 row)

Time: 348.712 ms
   count
----------
  34994502
(1 row)

Time: 351.901 ms
   count
----------
  33686292
(1 row)

Time: 487.837 ms
   count
----------
  33861658
(1 row)

Time: 40987.826 ms (00:40.988)
   count
----------
  33861381
(1 row)

Time: 76964.281 ms (01:16.964)
   count
----------
  33861176
(1 row)

Time: 483.329 ms
   count
----------
  33861071
(1 row)

Time: 18919.267 ms (00:18.919)
   count
----------
  29344354
(1 row)

Time: 50896.978 ms (00:50.897)
   count
----------
  30709227
(1 row)

Time: 25784.000 ms (00:25.784)
   count
----------
  32368001
(1 row)

Time: 45407.599 ms (00:45.408)
   count
----------
  33126826
(1 row)

Time: 75171.609 ms (01:15.172)
  count
----------
  33126808
(1 row)

Time: 88871.004 ms (01:28.871)
   count
----------
  33127317
(1 row)

Time: 128565.127 ms (02:08.565)
  count
----------
  33127192
(1 row)

Time: 210248.222 ms (03:30.248)
   count
----------
  33127117
(1 row)

Time: 246609.561 ms (04:06.610)   4 minutes!  ;p
   count
----------
  33124457
(1 row)

Time: 166406.283 ms (02:46.406)
   count
----------
  33123714
(1 row)

Time: 291483.538 ms (04:51.484)  nearly 5 minutes ;p ;p
   count
----------
  33123313
(1 row)

Time: 2269.961 ms (00:02.270)
   count
----------
  33123197
(1 row)

Time: 413.219 ms
   count
----------
  33123091
(1 row)

Time: 433.817 ms
   count
----------
  33122953
(1 row)

Time: 395.844 ms
   count
----------
  33122752
(1 row)

Time: 7251.637 ms (00:07.252)
   count
----------
  33122665
(1 row)

Time: 384.328 ms
   count
----------
  33122573
(1 row)

Time: 384.887 ms
ncsbe=#

I don't see a pattern in the above; matches the randomness I saw using
WB...

Before the above finished, I issued this command on another konsole...

$ while true; do ls -l > /tmp/ll; date; done
Mon Oct 23 05:12:54 PM EDT 2023
Mon Oct 23 05:12:54 PM EDT 2023
Mon Oct 23 05:12:54 PM EDT 2023
Mon Oct 23 05:12:54 PM EDT 2023
Mon Oct 23 05:12:54 PM EDT 2023
Mon Oct 23 05:12:54 PM EDT 2023
Mon Oct 23 05:12:54 PM EDT 2023
Mon Oct 23 05:12:54 PM EDT 2023
Mon Oct 23 05:12:54 PM EDT 2023
Mon Oct 23 05:12:54 PM EDT 2023
Mon Oct 23 05:12:54 PM EDT 2023
Mon Oct 23 05:12:54 PM EDT 2023
...
Lots of high speed output with no pauses, at 64 loops per second.

So that means only PG is getting disk waits.

This disk is dedicated to PG and a rarely used semi-permanent tmp
directory on a single partition.

This is from fdisk -l:
Disk /dev/nvme1n1: 3.64 TiB, 4000787030016 bytes, 7814037168 sectors
Disk model: Seagate FireCuda 530 ZP4000GM30023
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 9B3C5D73-6FFD-475D-9E1E-26F43B41C690

Device         Start        End    Sectors  Size Type
/dev/nvme1n1p1  2048 7814035455 7814033408  3.6T Linux filesystem

$ hdparm -Tt /dev/nvme1n1
/dev/nvme1n1:
  Timing cached reads:   36820 MB in  2.00 seconds = 18434.72 MB/sec
  Timing buffered disk reads: 774 MB in  3.00 seconds = 257.74 MB/sec

The 30 queries still hadn't finished at this point...

Interesting: the hdparm buffered read is very slow; once the queries
finished, the disk performance is back:
$ hdparm -Tt /dev/nvme1n1

/dev/nvme1n1:
  Timing cached reads:   37382 MB in  2.00 seconds = 18715.85 MB/sec
  Timing buffered disk reads: 5002 MB in  3.17 seconds = 1575.89 MB/sec

Over 6x faster...

So far, the only disk waits I see are PG related.  Do have any opinion on
Ken's comment about parallel queries?

Thanks,
Pierre

I just stopped PG, updated the full backup which was done overnight (so
the update took about a second), restarted PG and the problem persists...

Any suggestions where to look next?

Thanks,
Pierre

--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx