Re: Issues with large chunk size (16Mb)

NeilBrown <neilb@xxxxxxxx> · Thu, 29 Nov 2018 11:09:38 +1100

On Wed, Nov 28 2018, Ed Spiridonov wrote:

> On Tue, Nov 27, 2018 at 3:51 AM NeilBrown <neilb@xxxxxxxx> wrote:
>
>> >> Now I tried do change chunks size on the first server, but no success:
>> >> # mdadm --grow /dev/md3 --chunk=4096  --backup-file=/home/md3-backup
>> >> chunk size for /dev/md3 set to 16777216
>>
>> Hmmm - that's a bug.  In Grow.c (in mdadm)
>>
>>                                 printf("chunk size for %s set to %d\n",
>>                                        devname, array.chunk_size);
>>
>> should be
>>
>>                                 printf("chunk size for %s set to %d\n",
>>                                        devname, info->new_chunk);
>
> I see. But it shouldn't cause deny of array rebuilding (nothing
> happens except this message).
>
>> You shouldn't need --backup-file if kernel and mdadm are reasonably
>> recent.
>>
>> What kernel and what mdadm are you using?  What does "mdadm --examine"
>> of the devices show?
>
> I started with 4.18.12 from debian backports, later I switched to
> vanilla kernel.
> 4.20-rc4  is used now.

Any of these kernels should be able to reshape a raid6 without a backup
file.

>
> mdadm debian package version 3.4-4+b1
>
> # mdadm --version
> mdadm - v3.4 - 28th January 2016

I think the reshape-without-a-backup landed in 3.3.  It is certainly in 3.4.

>
> I uploaded mdadm --examine output to
> https://bugzilla.kernel.org/show_bug.cgi?id=201331

This only shows
     Raid Level : raid10
and
     Chunk Size : 512K

I thought the problem was with RAID6 and a chunk size of 16M ??

In any case, an import detail from this is :

   Unused Space : before=98216 sectors, after=32768 sectors

To perform a reshape-without-a-backup there needs to be at least one
chunk (the larger chunk size) either before or after.
Here there is plenty of room, so a
  mdadm --grow --chunk-size=4M /dev/md2
should work.

>
> Also I uploaded dmesg output with CONFIG_LOCKDEP=y
>

All the dmesgs (except the raid10) show the same basic problem which I
suspect the md_wakeup_thread() will fix.
If the above --grow command doesn't work, the output of dmesg and mdadm
--examine immediately after the failed attempt might be useful.

>
>> >> I have some questions:
>> >> 1. Is deadlock under load an expected behavior with 16Mb chunk size?
>> >> Or it is a bug and should be fixed?
>>
>> It's a bug.  Maybe it can be fixed by calling
>>   md_wakeup_thread(bitmap->mddev->thread);
>> in md_bitmap_startwrite() just before the call to schedule().
>
> Ok, I'll make a try.
>
>> >> 2. Is it possible to reshape existing RAID with smaller chunk size?
>> >> (without data loss)
>>
>> Yes.
>
> I have not managed yet.

Please try without --backup-file.

>
>> >> 3. Why chunk size over 4Mb causes bad write performance?
>>
>> The larger the chunk size, the more read-modify-write cycles are
>> needed.  With smaller chunk sizes, a write can cover whole stripes, and
>> doesn't need to read anything.
>
> I found threshold value 4Mib. With chunk size above it random write
> test produces lots of reads even if stripe_cache_size is set to
> maximum (32768)

4MiB chunks means 1024 entries in the stripe cache (4K pages) needed for
one stripe.  32768 entries should hold 32 full stripes.

But random-write will always generate lots of reads, unless the size of
each write is a full stripe - properly aligned.

With 10 drives and 4MiB chunks, a stripe is 32MiB.  Any random write
smaller than that *must* read some data from the array to be able to
update the party blocks.

> I do not understand why. IMHO if write block is less than the chunk it
> shouldn't matter how large the chunk size is.

True.  If the write is smaller than the chunk size then a single write
will be implemented as:
 - read the old data, and the P and Q blocks
 - calculate new P and Q (P' = P - D + D`)...
 - write new data, new P and new Q

so one write becomes 3 reads and 3 writes.
If the write is bigger than the chunk size the over-head of updating P
and Q reduces.  When the write is N-2 times the chunk size (and properly
aligned), the overhead disappears.

>
> BTW, why is stripe_cache_size limited to 32768? It seems that this
> limit could be safely increased for machines with a lot of RAM.

Historical reasons.  The stripe_cache_size is just a minimum.  When
there is demand and available memory, more are allocated automatically.

NeilBrown
Attachment:
signature.asc

Description: PGP signature