On Tue, Aug 20, 2019 at 07:53:20AM +0200, hch@xxxxxx wrote: > On Tue, Aug 20, 2019 at 02:41:35PM +1000, Dave Chinner wrote: > > > With the following debug patch. Based on that I think I'll just > > > formally submit the vmalloc switch as we're at -rc5, and then we > > > can restart the unaligned slub allocation drama.. > > > > This still doesn't make sense to me, because the pmem and brd code > > have no aligment limitations in their make_request code - they can > > handle byte adressing and should not have any problem at all with > > 8 byte aligned memory in bios. > > > > Digging a little furhter, I note that both brd and pmem use > > identical mechanisms to marshall data in and out of bios, so they > > are likely to have the same issue. > > > > So, brd_make_request() does: > > > > bio_for_each_segment(bvec, bio, iter) { > > unsigned int len = bvec.bv_len; > > int err; > > > > err = brd_do_bvec(brd, bvec.bv_page, len, bvec.bv_offset, > > bio_op(bio), sector); > > if (err) > > goto io_error; > > sector += len >> SECTOR_SHIFT; > > } > > > > So, the code behind bio_for_each_segment() splits multi-page bvecs > > into individual pages, which are passed to brd_do_bvec(). An > > unaligned 4kB io traces out as: > > > > [ 121.295550] p,o,l,s 00000000a77f0146,768,3328,0x7d0048 > > [ 121.297635] p,o,l,s 000000006ceca91e,0,768,0x7d004e > > > > i.e. page offset len sector > > 00000000a77f0146 768 3328 0x7d0048 > > 000000006ceca91e 0 768 0x7d004e > > > > You should be able to guess what the problems are from this. > > > > Both pmem and brd are _sector_ based. We've done a partial sector > > copy on the first bvec, then the second bvec has started the copy > > from the wrong offset into the sector we've done a partial copy > > from. > > > > IOWs, no error is reported when the bvec buffer isn't sector > > aligned, no error is reported when the length of data to copy was > > not a multiple of sector size, and no error was reported when we > > copied the same partial sector twice. > > Yes. I think bio_for_each_segment is buggy here, as it should not > blindly split by pages. CcingMing as he wrote much of this code. I'll > also dig into fixing it, but I just arrived in Japan and might be a > little jetlagged. > > > There's nothing quite like being repeatedly bitten by the same > > misalignment bug because there's no validation in the infrastructure > > that could catch it immediately and throw a useful warning/error > > message. > > The xen block driver doesn't use bio_for_each_segment, so it isn't > exactly the same but a very related issue. Both stem from the fact that nothing in the block layer validates memory alignment constraints. Xenblk, pmem and brd all return 511 to queue_dma_alignment(), and all break when passed memory that isn't aligned to 512 bytes. There aren't even debug options we can turn on that would tell use this is happening. Instead, we start with data corruption and have to work backwards to find the needle in the haystack from there. EIO and a WARN_ONCE would be a massive improvement here.... > Maybe until we sort > all this mess out we just need to depend on !SLUB_DEBUG for XFS? SLUB_DEBUG=y by itself doesn't cause problems - I run that all the time because otherwise there's no /proc/slabinfo with slub. I used KASAN to get the above alignment behaviour - it's SLUB_DEBUG_ON=y that perturbs the alignment, and I think the same thing can happen with SLAB_DEBUG=y, so there's several dependencies we'd have to add here. Is there any way we can pass kmalloc a "use aligned heap" GFP flag so that it allocates from the -rcl slabs to guarantee alignment rather than the standard slabs that change alignment with config options? Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx