On Thu, Aug 27, 2020 at 09:44:31AM +0100, Christoph Hellwig wrote: > On Mon, Aug 24, 2020 at 04:16:53PM +0100, Matthew Wilcox (Oracle) wrote: > > Iterate once for each THP instead of once for each base page. > > FYI, I've always been wondering if bio_for_each_segment_all is the > right interface for the I/O completions, because we generally don't > need the fake bvecs for each page. Only the first page can have an > offset, and only the last page can be end earlier than the end of > the page size. > > It would seem way more efficient to just have a helper that extracts > the offset and end, and just use that in a loop that does the way > cheaper iteration over the physical addresses only. This might (or > might) not be a good time to switch to that model for iomap. Something like this? static void iomap_read_end_io(struct bio *bio) { int i, error = blk_status_to_errno(bio->bi_status); for (i = 0; i < bio->bi_vcnt; i++) { struct bio_vec *bvec = &bio->bi_io_vec[i]; size_t offset = bvec->bv_offset; size_t length = bvec->bv_len; struct page *page = bvec->bv_page; while (length > 0) { size_t count = thp_size(page) - offset; if (count > length) count = length; iomap_read_page_end_io(page, offset, count, error); page += (offset + count) / PAGE_SIZE; length -= count; offset = 0; } } bio_put(bio); } Maybe I'm missing something important here, but it's significantly simpler code -- iomap_read_end_io() goes down from 816 bytes to 560 bytes (256 bytes less!) iomap_read_page_end_io is inlined into it both before and after. There is some weirdness going on with regards to bv_offset that I don't quite understand. In the original bvec_advance: bv->bv_page = bvec->bv_page + (bvec->bv_offset >> PAGE_SHIFT); bv->bv_offset = bvec->bv_offset & ~PAGE_MASK; which I cargo-culted into bvec_thp_advance as: bv->bv_page = thp_head(bvec->bv_page + (bvec->bv_offset >> PAGE_SHIFT)); page_size = thp_size(bv->bv_page); bv->bv_offset = bvec->bv_offset - (bv->bv_page - bvec->bv_page) * PAGE_SIZE; Is it possible to have a bvec with an offset that is larger than the size of bv_page? That doesn't seem like a useful thing to do, but if that needs to be supported, then the code up top doesn't do that. We maybe gain a little bit by counting length down to 0 instead of counting it up to bv_len. I dunno; reading the code over now, it doesn't seem like that much of a difference. Maybe you meant something different?