Re: Reproducible system lockup, extracting files into XFS on dm-raid5 on dm-integrity on HDD

"Zack Weinberg" <zack@xxxxxxxxxxxx> · Thu, 06 Jun 2024 11:48:57 -0400

On Wed, Jun 5, 2024, at 7:05 PM, Dave Chinner wrote:
> On Wed, Jun 05, 2024 at 02:40:45PM -0400, Zack Weinberg wrote:
>> I am experimenting with the use of dm-integrity underneath dm-raid,
>> to get around the problem where, if a RAID 1 or RAID 5 array is
>> inconsistent, you may not know which copy is the good one.  I have
>> found a reproducible hard lockup involving XFS, RAID 5 and dm-
>> integrity.
>
> I don't think there's any lockup or kernel bug here - this just looks
> to be a case of having a really, really slow storage setup and
> everything waiting for a huge amount of IO to complete to make
> forwards progress.
...
> Userspace stalls on on writes because there are too many dirty pages
> in RAM. It throttles all incoming writes, waiting for background
> writeback to clean dirty pages.  Data writeback requires block
> allocation which requires metadata modification. Metadata modification
> requires journal space reservations which block waiting for metadata
> writeback IO to complete. There are hours of metadata writeback needed
> to free journal space, so everything pauses waiting for metadata IO
> completion.

This makes a lot of sense.

> RAID 5 writes are slow with spinning disks. dm-integrity makes writes
> even slower.  If you storage array can sustain more than 50 random 4kB
> writes a second, I'd be very surprised. It's going to be -very slow-.

I wiped the contents of the filesystem and ran bonnie++ on it in direct
I/O mode with 4k block writes, skipping the one-character write and
small file creation tests.  This is what I got:

Version  2.00       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
64G:4k::65536                 15.8m  19 60.5m  26            218m  31 279.1  13
Latency                         659ms     517ms             61146us    3052ms

I think this is doing seek-and-read, not seek-and-write, but 300 random
reads per second is still really damn slow compared to the sequential
performance.  And it didn't lock up (with unchanged hung task timeout of
two minutes) so that also tends to confirm your hypothesis -- direct I/O
means no write backlog.

(Do you know of a good way to benchmark seek-and-write
performance, ideally directly on a block device instead of having
a filesystem present?)

I don't actually care how slow it is to write things to this array,
because (if I can ever get it working) it's meant to be archival
storage, written to only rarely.  But I do need to get this tarball
unpacked, I'd prefer it if the runtime of 'tar' would correspond closely
to the actual time required to get the data all the way to stable
storage, and disabling the hung task timeout seems like a kludge.

...
> So a 1.6GB journal can buffer hundreds of thousands of dirty 4kb
> metadata blocks with writeback pending. Once the journal is full,
> however, the filesystem has to start writing them back to make space
> in the journal for new incoming changes. At this point, the filesystem
> with throttle incoming metadata modifications to the rate at which it
> can remove dirty metadata from the journal. i.e. it will throttle
> incoming modifications to the sustained random 4kB write rate of your
> storage hardware.
>
> With at least a quarter of a million random 4kB writes pending in the
> journal when it starts throttling, I'd suggest that you're looking at
> several hours of waiting just to flush the journal, let alone complete
> the untar process which will be generating new metadata all the
> time....

This reminds me of the 'bufferbloat' phenomenon over in networking land.
Would it help to reduce the size of the journal to something like 6MB,
which (assuming 50 random writes per second) would take only 30s to
flush?  Is a journal that small, for a filesystem this large, likely to
cause other problems? Are there any other tuning knobs you can think of
that might restrict the rate of incoming metadata modifications from
'tar' to a sustainable level from the get-go, instead of barging ahead
and then hitting a wall? I'm inclined to doubt that VM-level writeback
controls (as suggested elsethread) will help much, since they would not
change how much data can pile up in the filesystem's journal, but I
could be wrong.

Thanks for your help so far.
zw