On Thu, Jun 06, 2024 at 11:48:57AM -0400, Zack Weinberg wrote: > On Wed, Jun 5, 2024, at 7:05 PM, Dave Chinner wrote: > > On Wed, Jun 05, 2024 at 02:40:45PM -0400, Zack Weinberg wrote: > >> I am experimenting with the use of dm-integrity underneath dm-raid, > >> to get around the problem where, if a RAID 1 or RAID 5 array is > >> inconsistent, you may not know which copy is the good one. I have > >> found a reproducible hard lockup involving XFS, RAID 5 and dm- > >> integrity. > > > > I don't think there's any lockup or kernel bug here - this just looks > > to be a case of having a really, really slow storage setup and > > everything waiting for a huge amount of IO to complete to make > > forwards progress. > ... > > Userspace stalls on on writes because there are too many dirty pages > > in RAM. It throttles all incoming writes, waiting for background > > writeback to clean dirty pages. Data writeback requires block > > allocation which requires metadata modification. Metadata modification > > requires journal space reservations which block waiting for metadata > > writeback IO to complete. There are hours of metadata writeback needed > > to free journal space, so everything pauses waiting for metadata IO > > completion. > > This makes a lot of sense. > > > RAID 5 writes are slow with spinning disks. dm-integrity makes writes > > even slower. If you storage array can sustain more than 50 random 4kB > > writes a second, I'd be very surprised. It's going to be -very slow-. > > I wiped the contents of the filesystem and ran bonnie++ on it in direct Wow, that's is sooooo 2000's :) > I/O mode with 4k block writes, skipping the one-character write and > small file creation tests. This is what I got: > > Version 2.00 ------Sequential Output------ --Sequential Input- --Random- > -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- > Name:Size etc /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP > 64G:4k::65536 15.8m 19 60.5m 26 218m 31 279.1 13 > Latency 659ms 517ms 61146us 3052ms > > I think this is doing seek-and-read, not seek-and-write, but 300 random > reads per second is still really damn slow compared to the sequential > performance. And it didn't lock up (with unchanged hung task timeout of > two minutes) so that also tends to confirm your hypothesis -- direct I/O > means no write backlog. > > (Do you know of a good way to benchmark seek-and-write > performance, ideally directly on a block device instead of having > a filesystem present?) fio. Use it with direct=1, bs=4k, rw=randwrite and you can point it at either a file or a block device. > I don't actually care how slow it is to write things to this array, > because (if I can ever get it working) it's meant to be archival > storage, written to only rarely. But I do need to get this tarball > unpacked, I'd prefer it if the runtime of 'tar' would correspond closely > to the actual time required to get the data all the way to stable > storage, and disabling the hung task timeout seems like a kludge. The hung task timeout is intended to capture deadlocks that are forever, not something that is blocking because it has to wait for a hundred thousand IOs to complete at 50 IOPS. When you have storage this slow and data sets this big, you have to tune these detectors so they don't report false positives. What you are doing is so far out of the "normal operation" window that it's no surprise that you're getting false positive ahng detections like this. > ... > > So a 1.6GB journal can buffer hundreds of thousands of dirty 4kb > > metadata blocks with writeback pending. Once the journal is full, > > however, the filesystem has to start writing them back to make space > > in the journal for new incoming changes. At this point, the filesystem > > with throttle incoming metadata modifications to the rate at which it > > can remove dirty metadata from the journal. i.e. it will throttle > > incoming modifications to the sustained random 4kB write rate of your > > storage hardware. > > > > With at least a quarter of a million random 4kB writes pending in the > > journal when it starts throttling, I'd suggest that you're looking at > > several hours of waiting just to flush the journal, let alone complete > > the untar process which will be generating new metadata all the > > time.... > > This reminds me of the 'bufferbloat' phenomenon over in networking land. Yes, exactly. Storage has a bandwidth delay product, just like networks, and when you put huge buffers in front of something with low bandwidth and long round trip latencies to try to maintain high throughput, it generally goes really bad the moemnt interactivity is required. > Would it help to reduce the size of the journal to something like 6MB, > which (assuming 50 random writes per second) would take only 30s to > flush? That's taking journal sizes to the other extreme. > Is a journal that small, for a filesystem this large, likely to > cause other problems? Definitely. e.g. not having enough journal space to allow aggregation of changes to the same structures in memory before they are written to disk. this alone will increase the required journal bandwidth for any given workload by 1-2 orders of magnitude. It will also increase the amount of metadata writeback by similar amounts because the window for relogging already dirty objects is now tiny compared to the dataset you are creating. IOWs, when you have low bandwidth, seek limited storage, making the journal too small can be much worse for performance than having a really large journal and hitting the problems you are seeing. A current mkfs.xfs defaults to a minimum log size of 64MB - that's probably a good choice for this particular storage setup as it's large enough to soak up short bursts of metadata activity, but no so large it pins GBs of dirty metadata in RAM that stuff will stall on. > Are there any other tuning knobs you can think of > that might restrict the rate of incoming metadata modifications from > 'tar' to a sustainable level from the get-go, instead of barging ahead > and then hitting a wall? > I'm inclined to doubt that VM-level writeback > controls (as suggested elsethread) will help much, since they would not > change how much data can pile up in the filesystem's journal, but I > could be wrong. No matter what you do you are going to have the workload throttled to disk speed -somewhere-. Reducing the dirty limits on the page cache will help in the same way that reducing the journal size will, and that should help improve interactivity a bit. But, fundamentally, the data set is much larger than RAM and so it will get written at disk speed and that means worst case latencies for anything the kernel does can be very high. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx