On Fri, Jul 01, 2022 at 08:43:32PM +0100, Al Viro wrote: > On Fri, Jul 01, 2022 at 01:28:13PM -0600, Keith Busch wrote: > > On Fri, Jul 01, 2022 at 08:08:37PM +0100, Al Viro wrote: > > > On Fri, Jul 01, 2022 at 12:38:36PM -0600, Keith Busch wrote: > > > > > size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset); > > > > > - if (size > 0) > > > > > - size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev)); > > > > > if (unlikely(size <= 0)) > > > > > return size ? size : -EFAULT; > > > > > + nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE); > > > > > > > > > > + size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev)); > > > > > > > > We still need to return EFAULT if size becomes 0 because that's the only way > > > > bio_iov_iter_get_pages()'s loop will break out in this condition. > > > > > > I really don't like these calling conventions ;-/ > > > > No argument here; I'm just working in the space as I found it. :) > > > > > What do you want to happen if we have that align-down to reduce size? > > > IOW, what should be the state after that loop in the caller? > > > > We fill up the bio to bi_max_vecs. If there are more pages than vectors, then > > the bio is submitted as-is with the partially drained iov_iter. The remainder > > of the iov is left for a subsequent bio to deal with. > > > > The ALIGN_DOWN is required because I've replaced the artificial kernel aligment > > with the underlying hardware's alignment. The hardware's alignment is usually > > smaller than a block size. If the last bvec has a non-block aligned offset, > > then it has to be truncated down, which could result in a 0 size when bi_vcnt > > is already non-zero. If that happens, we just submit the bio as-is, and > > allocate a new one to deal with the rest of the iov. > > Wait a sec. Looks like you are dealing with the round-down in the wrong place - > it applies to the *total* you've packed into the bio, no matter how it is > distributed over the segments you've gathered it from. Looks like it would > be more natural to handle it after the loop in the caller, would it not? > > I.e. > while bio is not full > grab pages > if got nothing > break > pack pages into bio > if can't add a page (bio_add_hw_page() failure) > drop the ones not shoved there > break > see how much had we packed into the sucker > if not a multiple of logical block size > trim the bio, dropping what needs to be dropped > iov_iter_revert() > if nothing's packed > return -EINVAL if it was a failed bio_add_hw_page() > return -EFAULT otherwise > return 0 Validating user requests gets really messy if we allow arbitrary segment lengths. This particular patch just enables arbitrary address alignment, but segment size is still required to be a block size. You found the commit that enforces that earlier, "iov: introduce iov_iter_aligned", two commits prior. The rest of the logic simplifies when we are guaranteed segment size is a block size mulitple: truncating a segment at a block boundary ensures both sides are block size aligned, and we don't even need to consult lower level queue limits, like segment count or segment length. The bio is valid after this step, and can be split into valid bios later if needed.