On Wed, Nov 28 2018, Ed Spiridonov wrote: > On Tue, Nov 27, 2018 at 3:51 AM NeilBrown <neilb@xxxxxxxx> wrote: > >> >> Now I tried do change chunks size on the first server, but no success: >> >> # mdadm --grow /dev/md3 --chunk=4096 --backup-file=/home/md3-backup >> >> chunk size for /dev/md3 set to 16777216 >> >> Hmmm - that's a bug. In Grow.c (in mdadm) >> >> printf("chunk size for %s set to %d\n", >> devname, array.chunk_size); >> >> should be >> >> printf("chunk size for %s set to %d\n", >> devname, info->new_chunk); > > I see. But it shouldn't cause deny of array rebuilding (nothing > happens except this message). > >> You shouldn't need --backup-file if kernel and mdadm are reasonably >> recent. >> >> What kernel and what mdadm are you using? What does "mdadm --examine" >> of the devices show? > > I started with 4.18.12 from debian backports, later I switched to > vanilla kernel. > 4.20-rc4 is used now. Any of these kernels should be able to reshape a raid6 without a backup file. > > mdadm debian package version 3.4-4+b1 > > # mdadm --version > mdadm - v3.4 - 28th January 2016 I think the reshape-without-a-backup landed in 3.3. It is certainly in 3.4. > > I uploaded mdadm --examine output to > https://bugzilla.kernel.org/show_bug.cgi?id=201331 This only shows Raid Level : raid10 and Chunk Size : 512K I thought the problem was with RAID6 and a chunk size of 16M ?? In any case, an import detail from this is : Unused Space : before=98216 sectors, after=32768 sectors To perform a reshape-without-a-backup there needs to be at least one chunk (the larger chunk size) either before or after. Here there is plenty of room, so a mdadm --grow --chunk-size=4M /dev/md2 should work. > > Also I uploaded dmesg output with CONFIG_LOCKDEP=y > All the dmesgs (except the raid10) show the same basic problem which I suspect the md_wakeup_thread() will fix. If the above --grow command doesn't work, the output of dmesg and mdadm --examine immediately after the failed attempt might be useful. > >> >> I have some questions: >> >> 1. Is deadlock under load an expected behavior with 16Mb chunk size? >> >> Or it is a bug and should be fixed? >> >> It's a bug. Maybe it can be fixed by calling >> md_wakeup_thread(bitmap->mddev->thread); >> in md_bitmap_startwrite() just before the call to schedule(). > > Ok, I'll make a try. > >> >> 2. Is it possible to reshape existing RAID with smaller chunk size? >> >> (without data loss) >> >> Yes. > > I have not managed yet. Please try without --backup-file. > >> >> 3. Why chunk size over 4Mb causes bad write performance? >> >> The larger the chunk size, the more read-modify-write cycles are >> needed. With smaller chunk sizes, a write can cover whole stripes, and >> doesn't need to read anything. > > I found threshold value 4Mib. With chunk size above it random write > test produces lots of reads even if stripe_cache_size is set to > maximum (32768) 4MiB chunks means 1024 entries in the stripe cache (4K pages) needed for one stripe. 32768 entries should hold 32 full stripes. But random-write will always generate lots of reads, unless the size of each write is a full stripe - properly aligned. With 10 drives and 4MiB chunks, a stripe is 32MiB. Any random write smaller than that *must* read some data from the array to be able to update the party blocks. > I do not understand why. IMHO if write block is less than the chunk it > shouldn't matter how large the chunk size is. True. If the write is smaller than the chunk size then a single write will be implemented as: - read the old data, and the P and Q blocks - calculate new P and Q (P' = P - D + D`)... - write new data, new P and new Q so one write becomes 3 reads and 3 writes. If the write is bigger than the chunk size the over-head of updating P and Q reduces. When the write is N-2 times the chunk size (and properly aligned), the overhead disappears. > > BTW, why is stripe_cache_size limited to 32768? It seems that this > limit could be safely increased for machines with a lot of RAM. Historical reasons. The stripe_cache_size is just a minimum. When there is demand and available memory, more are allocated automatically. NeilBrown
Attachment:
signature.asc
Description: PGP signature