Re: How to debug intermittent increasing md/inflight but no disk activity?

Paul Menzel <pmenzel@xxxxxxxxxxxxx> · Tue, 23 Jul 2024 17:13:21 +0200

Dear Andre, dear Dave,

Thank you for your replies.

Am 11.07.24 um 13:23 schrieb Andre Noll:
On Thu, Jul 11, 09:12, Dave Chinner wrote

Of course it’s not reproducible, but any insight how to debug this next time
is much welcomed.

Probably not a lot you can do short of reconfiguring your RAID6
storage devices to handle small IOs better. However, in general,
RAID6 /always sucks/ for small IOs, and the only way to fix this
problem is to use high performance SSDs to give you a massive excess
of write bandwidth to burn on write amplification....

FWIW, our approach to mitigate the write amplification suckage of large
HDD-backed raid6 arrays for small I/Os is to set up a bcache device
by combining such arrays with two small SSDs (configured as raid1).

Now that file servers with software RAID proliferate in our institute 
due to old systems with battery backed hardware RAID controllers are 
taken offline, we noticed performance problems. (We still have not found 
the silver bullet yet.) My colleague Donald was testing bcache in March, 
but due to the slightly more complex setup, a colleague is currently 
experimenting with a write journal for the software RAID.

Kind regards,

Paul

PS: *bcache* performance test:

    time bash -c '(cd /jbod/MG002/scratch/x && for i in $(seq -w 1000); 
do echo a >  data.$i; done)'

| setting                                | time/s  | time/s  | time/s |
|----------------------------------------|---------|---------|--------|
| xfs/raid6                              | 40.826 | 41.638 | 44.685 |
| bcache/xfs/raid6 mode none             | 32.642 | 29.274 | 27.491 |
| bcache/xfs/raid6 mode writethrough     | 27.028 | 31.754 | 28.884 |
| bache/xfs/raid6 mode writearound       | 24.526 | 30.808 | 28.940 |
| bcache/xfs/raid6 mode writeback        |  5.795 |  6.456 |  7.230 |
| bcachefs 10+2                          | 10,321 | 11,832 | 12,671 |
| bcachefs 10+2+nvme (writeback)         |  9.026 |  8.676 |  8.619 |
| xfs/raid6 (12*100GB)                   | 32.446 | 25.583 | 24.007 |
| xfs/raid5 (12*100GB)                   | 27.934 | 23.705 | 22.558 |
| xfs/bcache(10*raid6,2*raid1 cache) writethrough | 56.240 | 47.997 | 
45.321 |
| xfs/bcache(10*raid6,2*raid1 cache) writeback  | 82.230 | 85.779 | 85.814 |
| xfs/bcache(10*raid6,2*raid1 cache(ssd)) writethrough | 26.459 | 23.631 
| 23.586 |
| xfs/bcache(10*raid6,2*raid1 cache(ssd)) writeback  |  7.729 |  7.073 | 
 6.958 |
| as above with sequential_cutoff=0      |  6.397 |  6.826 |  6.759 |

`sequential_cutoff=0` significantly speeds up the `tar xf 
node-v20.11.0.tar.gz` from 13m45.108s to 5m31.379s ! Maybe the 
sequential cutoff thing doesn't work well over nfs.

1.  Build kernel over NFS with the usual setup: 27m38s
2.  Build kernel over NFS with xfs+bcache with two (raid1) SSDs: 10m27s