Re: Issues with large chunk size (16Mb)

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Sat, 24 Nov 2018 00:13:09 -0700

On Sun, Nov 18, 2018 at 8:01 PM Ed Spiridonov <edo.rus@xxxxxxxxx> wrote:
>
> I've set up server with big amount of disk space available (10x10Tb HDDs).
> This server should deliver (over HTTP) files to many clients, usual
> file size is several Mb.
> I use RAID 6 and XFS.
> I make a decision to make chunk size as large as possible.
>
> My reasoning is:
> HDD performance is mostly limited by seeks.
> With default chunk size (512Kb) reading of 4Mb file touches 8 HDDs (8 seeks)
> With large chunk size only one HDD is touched (1 seek).
>
> So I created array with maximal possible chunk size (16Mb).
> And I have an issues with this array.
> https://bugzilla.kernel.org/show_bug.cgi?id=201331
>
> I have another server with similar setup. I did some tests on it.
> As expected large chunk size provides significantly better
> multithreaded large block read performance.
> But write performance drops with chunk size over 4Mb.
> So I set up second server with chunk size 4Mb. And I have no such
> deadlocks with this server.
>
> Now I tried do change chunks size on the first server, but no success:
> # mdadm --grow /dev/md3 --chunk=4096  --backup-file=/home/md3-backup
> chunk size for /dev/md3 set to 16777216
>
> (and no changes in /proc/mdstat)
>
> I have some questions:
> 1. Is deadlock under load an expected behavior with 16Mb chunk size?
> Or it is a bug and should be fixed?
> 2. Is it possible to reshape existing RAID with smaller chunk size?
> (without data loss)
> 3. Why chunk size over 4Mb causes bad write performance?

I'm going to guess that you're getting some small file writes, even if
they're just file system metadata updates scattered about the address
space. And little writes to many scattered areas means many 16MB * 8 =
full stripe writes every time for read, modify, write of a stripe. For
raid5 there's an optimization which I think is sector size, so for
just 512b of file system metadata changing, you'd get a 512b sector
write for the data (the fs metadata) and a 512b sector write for
parity. But I don't think there's any such optimization for raid6, so
maybe someone else can answer that question.

If the use case were something more like WORM, where there's
insignificant modification of the file system, then maybe 16MB strips
would be OK. But the vast majority of use cases where data is churning
are better off with 64KB strip size. You want to maximize full stripe
writes and avoid as much RMW as possible. And you're pretty much
guaranteeing a ton of RMW where any small change requires writing out
128MB.

-- 
Chris Murphy