Re: Issues with large chunk size (16Mb)

NeilBrown <neilb@xxxxxxxx> · Tue, 27 Nov 2018 11:51:24 +1100

On Sat, Nov 24 2018, Chris Murphy wrote:

> On Sun, Nov 18, 2018 at 8:01 PM Ed Spiridonov <edo.rus@xxxxxxxxx> wrote:
>>
>> I've set up server with big amount of disk space available (10x10Tb HDDs).
>> This server should deliver (over HTTP) files to many clients, usual
>> file size is several Mb.
>> I use RAID 6 and XFS.
>> I make a decision to make chunk size as large as possible.
>>
>> My reasoning is:
>> HDD performance is mostly limited by seeks.
>> With default chunk size (512Kb) reading of 4Mb file touches 8 HDDs (8 seeks)
>> With large chunk size only one HDD is touched (1 seek).

Reads prefer large chunks, writes prefer small chunks.

>>
>> So I created array with maximal possible chunk size (16Mb).
>> And I have an issues with this array.
>> https://bugzilla.kernel.org/show_bug.cgi?id=201331
>>
>> I have another server with similar setup. I did some tests on it.
>> As expected large chunk size provides significantly better
>> multithreaded large block read performance.
>> But write performance drops with chunk size over 4Mb.
>> So I set up second server with chunk size 4Mb. And I have no such
>> deadlocks with this server.
>>
>> Now I tried do change chunks size on the first server, but no success:
>> # mdadm --grow /dev/md3 --chunk=4096  --backup-file=/home/md3-backup
>> chunk size for /dev/md3 set to 16777216

Hmmm - that's a bug.  In Grow.c (in mdadm)

				printf("chunk size for %s set to %d\n",
				       devname, array.chunk_size);

should be

				printf("chunk size for %s set to %d\n",
				       devname, info->new_chunk);

You shouldn't need --backup-file if kernel and mdadm are reasonably
recent.

What kernel and what mdadm are you using?  What does "mdadm --examine"
of the devices show?

>>
>> (and no changes in /proc/mdstat)
>>
>> I have some questions:
>> 1. Is deadlock under load an expected behavior with 16Mb chunk size?
>> Or it is a bug and should be fixed?

It's a bug.  Maybe it can be fixed by calling
  md_wakeup_thread(bitmap->mddev->thread);
in md_bitmap_startwrite() just before the call to schedule().

>> 2. Is it possible to reshape existing RAID with smaller chunk size?
>> (without data loss)

Yes.

>> 3. Why chunk size over 4Mb causes bad write performance?

The larger the chunk size, the more read-modify-write cycles are
needed.  With smaller chunk sizes, a write can cover whole stripes, and
doesn't need to read anything.

NeilBrown
(I didn't get the original email due to email problems, so I'm replying
to a reply).
Attachment:
signature.asc

Description: PGP signature