Re: Issues with large chunk size (16Mb)

Ed Spiridonov <edo.rus@xxxxxxxxx> · Wed, 28 Nov 2018 15:21:32 +0300

On Tue, Nov 27, 2018 at 3:51 AM NeilBrown <neilb@xxxxxxxx> wrote:

> >> Now I tried do change chunks size on the first server, but no success:
> >> # mdadm --grow /dev/md3 --chunk=4096  --backup-file=/home/md3-backup
> >> chunk size for /dev/md3 set to 16777216
>
> Hmmm - that's a bug.  In Grow.c (in mdadm)
>
>                                 printf("chunk size for %s set to %d\n",
>                                        devname, array.chunk_size);
>
> should be
>
>                                 printf("chunk size for %s set to %d\n",
>                                        devname, info->new_chunk);

I see. But it shouldn't cause deny of array rebuilding (nothing
happens except this message).

> You shouldn't need --backup-file if kernel and mdadm are reasonably
> recent.
>
> What kernel and what mdadm are you using?  What does "mdadm --examine"
> of the devices show?

I started with 4.18.12 from debian backports, later I switched to
vanilla kernel.
4.20-rc4  is used now.

mdadm debian package version 3.4-4+b1

# mdadm --version
mdadm - v3.4 - 28th January 2016

I uploaded mdadm --examine output to
https://bugzilla.kernel.org/show_bug.cgi?id=201331

Also I uploaded dmesg output with CONFIG_LOCKDEP=y

> >> I have some questions:
> >> 1. Is deadlock under load an expected behavior with 16Mb chunk size?
> >> Or it is a bug and should be fixed?
>
> It's a bug.  Maybe it can be fixed by calling
>   md_wakeup_thread(bitmap->mddev->thread);
> in md_bitmap_startwrite() just before the call to schedule().

Ok, I'll make a try.

> >> 2. Is it possible to reshape existing RAID with smaller chunk size?
> >> (without data loss)
>
> Yes.

I have not managed yet.

> >> 3. Why chunk size over 4Mb causes bad write performance?
>
> The larger the chunk size, the more read-modify-write cycles are
> needed.  With smaller chunk sizes, a write can cover whole stripes, and
> doesn't need to read anything.

I found threshold value 4Mib. With chunk size above it random write
test produces lots of reads even if stripe_cache_size is set to
maximum (32768)
I do not understand why. IMHO if write block is less than the chunk it
shouldn't matter how large the chunk size is.

BTW, why is stripe_cache_size limited to 32768? It seems that this
limit could be safely increased for machines with a lot of RAM.