Are there any devs, experts in this area, that can chime in? I can provide more information (see the start of array details). Was the array assembled badly? This is the only change from when it worked well. FYI: About a week ago, when the array was degraded, the machine crashed. The array was revived (--assemble --force) and looked good (clean fsck). Later the missing disk was added. No errors. On 10/11/2023 00.16, Roger Heflin wrote:
The issue he is asking about is why does the md127 device have a w_await time of 1473.96ms when the highest response time on the underlying device is 15.94ms. %util of all devices is low. Something has to be going on inside the kernel to cause there to be that sort of a difference. He is also doing about 3 io/second which is significantly under the capacity of a spinning disk. If a disk was having issues with bad sectors/failed sectors/silent retries then the %util/awaits should show that. So between when the write gets to the md device and when the io gets to the physical device there are 1000+ ms (on average). And this is a typical sample that is being seen when the array is not functioning fast, and this issue clears (all by itself) some minutes or hours later and continues as if nothing happened.
I think that this is the crux of the problem. I keep a log of iostat and meminfo and I see this happening all the time. I understand that writing a tree full of many, very small files is an issue for ext4 but at the level of the cache we are past this, we simply have a list of blocks that need to be written out. The cache is supposed to offer an opportunity to coalesce small writes, and since the whole tree is small I expect it to end up with a reasonably small number of blocks. Q1 >>>>> Do I have a wrong setting for some queuing? Maybe some rate limiting? One caveat: the whole fs was copied from a 20TB array to a new 60TB array 5 years ago, and now has 10TB free, so there was activity all the time (mostly large mythtv recordings) and free space is now probably somewhat fragmented. Below is one example run. Q2 >>>>> Does it look like a problem with the: cache? md? disks? Q3 >>>>> Look at the meminfo log. The last 12MB (very end of this posting) were written out in 1 second. So the system CAN do this when it wants to. WHAT GIVES?? The source ========== $ sudo find /data2/no-backup/old-backups/tapes/12/file.4.data |wc -l 19533 $ sudo du -sm /data2/no-backup/old-backups/tapes/12/file.4.data 321 /data2/no-backup/old-backups/tapes/12/file.4.data The target ========== $ sudo find /data1/no-backup/old-backups/tapes/12/file.4.data |wc -l 19533 $ sudo du -sm /data1/no-backup/old-backups/tapes/12/file.4.data 245 /data1/no-backup/old-backups/tapes/12/file.4.data The copy ======== $ sudo rsync -aHSK --stats --progress --checksum-choice=none --no-compress -W /data2/no-backup/old-backups/tapes/12/file.4.data /data1/no-backup/old-backups/tapes-again/12/ ## I expect some options are probably not required Number of files: 23,208 (reg: 21,586, dir: 1,617, link: 5) Number of created files: 23,208 (reg: 21,586, dir: 1,617, link: 5) Number of deleted files: 0 Number of regular files transferred: 21,453 Total file size: 284,010,162 bytes Total transferred file size: 283,755,872 bytes Literal data: 283,755,872 bytes Matched data: 0 bytes File list size: 131,067 File list generation time: 0.001 seconds File list transfer time: 0.000 seconds Total bytes sent: 284,848,197 Total bytes received: 430,683 sent 284,848,197 bytes received 430,683 bytes 38,037,184.00 bytes/sec total size is 284,010,162 speedup is 1.00 ## The copy took only a few of seconds. Draining the dirty blocks. It took just under 3 hours! Zooming into the data: 2023-11-13 11:16:11 Dirty: 12012 kB Buffers: 178292 kB MemFree: 756872 kB 2023-11-13 11:16:12 Dirty: 12008 kB Buffers: 178292 kB MemFree: 760332 kB 2023-11-13 11:16:13 Dirty: 12 kB Buffers: 178300 kB MemFree: 756152 kB meminfo ======= 2023-11-13 08:27:15 Dirty: 32 kB Buffers: 1029000 kB MemFree: 870192 kB 2023-11-13 08:27:25 Dirty: 140448 kB Buffers: 1039564 kB MemFree: 674356 kB 2023-11-13 08:27:35 Dirty: 327736 kB Buffers: 1049404 kB MemFree: 613680 kB 2023-11-13 08:27:45 Dirty: 325024 kB Buffers: 1049436 kB MemFree: 643332 kB 2023-11-13 08:27:55 Dirty: 324972 kB Buffers: 1049460 kB MemFree: 655180 kB ... 2023-11-13 11:15:49 Dirty: 12916 kB Buffers: 178196 kB MemFree: 725476 kB 2023-11-13 11:15:59 Dirty: 12824 kB Buffers: 178240 kB MemFree: 725776 kB 2023-11-13 11:16:09 Dirty: 12016 kB Buffers: 178292 kB MemFree: 758076 kB 2023-11-13 11:16:19 Dirty: 116 kB Buffers: 178316 kB MemFree: 752152 kB Since the start of the copy, a kworker flush thread is running: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2351470 root 20 0 0 0 0 R 100.0 0.0 73:28.06 kworker/u16:1+flush-9:127 disks activity ============== Activity on the array, low wkB/s, high w_await On a clear day the array tops 600MB/s Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util 08:32:49 md127 0.10 0.40 0.00 0.00 18.00 4.00 1.70 9.60 0.00 0.00 1316.47 5.65 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.24 6.62 08:32:59 md127 0.00 0.00 0.00 0.00 0.00 0.00 3.00 63.60 0.00 0.00 1280.20 21.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.84 5.32 08:33:09 md127 0.00 0.00 0.00 0.00 0.00 0.00 0.60 2.40 0.00 0.00 17.33 4.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 11.60 08:33:19 md127 0.00 0.00 0.00 0.00 0.00 0.00 3.40 35.20 0.00 0.00 5774.44 10.35 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 19.63 10.00 08:33:29 md127 0.20 0.80 0.00 0.00 15.00 4.00 25.30 320.40 0.00 0.00 3751.42 12.66 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 94.91 6.70 08:33:39 md127 0.00 0.00 0.00 0.00 0.00 0.00 2.70 10.80 0.00 0.00 803.52 4.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.17 4.49 08:33:49 md127 0.00 0.00 0.00 0.00 0.00 0.00 2.70 10.80 0.00 0.00 1415.26 4.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.82 10.47 08:33:59 md127 0.40 1.60 0.00 0.00 9.75 4.00 3.30 70.80 0.00 0.00 1156.67 21.45 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.82 9.92 08:34:09 md127 0.00 0.00 0.00 0.00 0.00 0.00 3.80 97.20 0.00 0.00 1175.97 25.58 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.47 8.48 Activity on a member (all 7 are similar), low wkB/s, low w_await On a clear day each member tops 200MB/s Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util 08:32:49 sdb 0.00 0.00 0.00 0.00 0.00 0.00 1.00 2.20 0.00 0.00 2.20 2.20 0.00 0.00 0.00 0.00 0.00 0.00 1.00 1.90 0.00 0.32 08:32:59 sdb 0.40 42.40 10.20 96.23 4.00 106.00 1.90 46.20 10.20 84.30 4.47 24.32 0.00 0.00 0.00 0.00 0.00 0.00 1.40 4.29 0.02 0.75 08:33:09 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.20 0.40 0.00 0.00 0.50 2.00 0.00 0.00 0.00 0.00 0.00 0.00 0.20 0.00 0.00 0.03 08:33:19 sdb 0.10 30.80 7.60 98.70 19.00 308.00 0.80 32.10 7.60 90.48 6.62 40.12 0.00 0.00 0.00 0.00 0.00 0.00 0.70 7.29 0.01 0.60 08:33:29 sdb 0.10 3.20 0.70 87.50 0.00 32.00 2.20 85.00 19.60 89.91 3.05 38.64 0.00 0.00 0.00 0.00 0.00 0.00 1.40 2.86 0.01 0.69 08:33:39 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.70 1.70 0.00 0.00 1.00 2.43 0.00 0.00 0.00 0.00 0.00 0.00 0.70 0.86 0.00 0.12 08:33:49 sdb 0.00 0.00 0.00 0.00 0.00 0.00 1.40 3.40 0.00 0.00 2.79 2.43 0.00 0.00 0.00 0.00 0.00 0.00 1.30 2.54 0.01 0.47 08:33:59 sdb 0.30 58.80 14.40 97.96 11.67 196.00 1.30 61.30 14.40 91.72 3.15 47.15 0.00 0.00 0.00 0.00 0.00 0.00 1.00 2.10 0.01 0.51 08:34:09 sdb 0.10 7.20 1.70 94.44 4.00 72.00 1.50 4.20 0.10 6.25 5.13 2.80 0.00 0.00 0.00 0.00 0.00 0.00 1.40 5.07 0.02 0.89 Here is the iostat -x covering the last second. A good burp! Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util 11:16:20 md127 0.00 0.00 0.00 0.00 0.00 0.00 300.90 1258.00 0.00 0.00 515.35 4.18 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 155.07 11.13 11:16:20 sdb 8.40 298.40 66.20 88.74 131.21 35.52 15.80 435.35 93.50 85.54 32.25 27.55 0.00 0.00 0.00 0.00 0.00 0.00 1.30 7.38 1.62 2.92 11:16:20 sdc 9.80 338.80 74.90 88.43 126.00 34.57 19.70 598.55 130.40 86.88 50.57 30.38 0.00 0.00 0.00 0.00 0.00 0.00 1.30 15.92 2.25 4.05 11:16:20 sdd 12.60 425.60 93.80 88.16 249.92 33.78 19.90 640.95 140.80 87.62 56.86 32.21 0.00 0.00 0.00 0.00 0.00 0.00 1.30 24.15 4.31 5.44 11:16:20 sde 12.60 368.80 79.60 86.33 211.07 29.27 17.80 475.35 101.50 85.08 99.42 26.71 0.00 0.00 0.00 0.00 0.00 0.00 1.30 24.38 4.46 6.38 11:16:20 sdf 8.30 303.60 67.60 89.06 129.25 36.58 10.90 329.35 71.90 86.84 36.05 30.22 0.00 0.00 0.00 0.00 0.00 0.00 1.30 18.38 1.49 6.32 11:16:20 sdg 9.50 256.80 54.70 85.20 108.25 27.03 13.60 300.95 62.10 82.03 13.79 22.13 0.00 0.00 0.00 0.00 0.00 0.00 1.30 8.00 1.23 4.73 11:16:20 sdh 9.20 232.40 48.90 84.17 241.24 25.26 16.80 247.35 45.50 73.03 65.06 14.72 0.00 0.00 0.00 0.00 0.00 0.00 1.30 18.38 3.34 5.86 TIA -- Eyal at Home (eyal@xxxxxxxxxxxxxx)