On Tue, Dec 20, 2022 at 03:59:39PM -0800, Ira Weiny wrote: > On Tue, Dec 20, 2022 at 06:34:57PM +0000, Matthew Wilcox wrote: > > On Tue, Dec 20, 2022 at 08:58:52AM -0800, Ira Weiny wrote: > > > On Tue, Dec 20, 2022 at 12:18:01PM +0100, Jan Kara wrote: > > > > On Tue 20-12-22 09:35:43, Matthew Wilcox wrote: > > > > > But that doesn't solve the "What about fs block size > PAGE_SIZE" > > > > > problem that we also want to solve. Here's a concrete example: > > > > > > > > > > static __u32 jbd2_checksum_data(__u32 crc32_sum, struct buffer_head *bh) > > > > > { > > > > > - struct page *page = bh->b_page; > > > > > + struct folio *folio = bh->b_folio; > > > > > char *addr; > > > > > __u32 checksum; > > > > > > > > > > - addr = kmap_atomic(page); > > > > > - checksum = crc32_be(crc32_sum, > > > > > - (void *)(addr + offset_in_page(bh->b_data)), bh->b_size); > > > > > - kunmap_atomic(addr); > > > > > + BUG_ON(IS_ENABLED(CONFIG_HIGHMEM) && bh->b_size > PAGE_SIZE); > > > > > + > > > > > + addr = kmap_local_folio(folio, offset_in_folio(folio, bh->b_data)); > > > > > + checksum = crc32_be(crc32_sum, addr, bh->b_size); > > > > > + kunmap_local(addr); > > > > > > > > > > return checksum; > > > > > } > > > > > > > > > > I don't want to add a lot of complexity to handle the case of b_size > > > > > > PAGE_SIZE on a HIGHMEM machine since that's not going to benefit terribly > > > > > many people. I'd rather have the assertion that we don't support it. > > > > > But if there's a good higher-level abstraction I'm missing here ... > > > > > > > > Just out of curiosity: So far I was thinking folio is physically contiguous > > > > chunk of memory. And if it is, then it does not seem as a huge overkill if > > > > kmap_local_folio() just maps the whole folio? > > > > > > Willy proposed that previously but we could not come to a consensus on how to > > > do it. > > > > > > https://lore.kernel.org/all/Yv2VouJb2pNbP59m@iweiny-desk3/ > > > > > > FWIW I still think increasing the entries to cover any foreseeable need would > > > be sufficient because HIGHMEM does not need to be optimized. Couldn't we hide > > > the entry count into some config option which is only set if a FS needs a > > > larger block size on a HIGHMEM system? > > > > "any foreseeable need"? I mean ... I'd like to support 2MB folios, > > even on HIGHMEM machines, and that's 512 entries. If we're doing > > memcpy_to_folio(), we know that's only one mapping, but still, 512 > > entries is _a lot_ of address space to be reserving on a 32-bit machine. > > I'm confused. A memcpy_to_folio() could loop to map the pages as needed > depending on the amount of data to copy. Or just map/unmap in a loop. > > This seems like an argument to have a memcpy_to_folio() to hide such nastiness > on HIGHMEM from the user. I see that you are confused. What I'm not quite sure of is how I confused you, so I'm just going to try again in different words. Given the desire to support 2MB folios on x86/ARM PAE systems, we can't have a kmap_local_entire_folio() because that would take up too much address space. But we can have a kmap_local_buffer() / kummap_local_buffer(). We can restrict the maximum fs block size (== buffer->b-size) to a reasonably small multiple of PAGE_SIZE, eg 16. That will let us kmap the entire buffer, after making some of the changes described below. That solves the jbd2_checksum_data() problem above, but isn't necessarily the best solution for every filesystem "need to copy to a folio" problem. So I think we do want memcpy_to/from_folio(), split out like the current zero_user_segments(). I also think we want a copy_folio_from_iter_atomic(). Right now iomap_write_iter() is a bit of a mess; it retrieves a multi-page folio from the page cache multiple times instead of copying as much as it can from userspace to the folio. There are some interesting issues to deal with here, but putting it in iov_iter.c is better than hiding it in the iomap code. > > I don't know exactly what the address space layout is on x86-PAE or > > ARM-PAE these days, but as I recall, the low 3GB is user and the high > > 1GB is divided between LOWMEM and VMAP space; something like 800MB of > > LOWMEM and 200MB of vmap/kmap/PCI iomem/... > > > > Where I think we can absolutely get away with this reasoning is having > > a kmap_local_buffer(). It's perfectly reasonable to restrict fs block > > size to 64kB (after all, we've been limiting it to 4kB on x86 for thirty > > years), and having a __kmap_local_pfns(pfn, n, prot) doesn't seem like > > a terribly bad idea to me. > > > > So ... is this our path forward: > > > > - Introduce a complex memcpy_to/from_folio() in highmem.c that mirrors > > zero_user_segments() > > - Have a simple memcpy_to/from_folio() in highmem.h that mirrors > > zero_user_segments() > > I'm confused again. What is the difference between the complex/simple other > than inline vs not? > > > - Convert __kmap_local_pfn_prot() to __kmap_local_pfns() > > I'm not sure I follow this need but I think you are speaking of having the > mapping of multiple pages in a tight loop in the preemption disabled region? > > Frankly, I think this is an over optimization for HIGHMEM. Just loop calling > kmap_local_page() (either with or without an unmap depending on the details.) See the jbd2_checksum_data() example at the top, and design me a better API that doesn't involve putting complexity into jbd2 ;-) > > - Add kmap_local_buffer() that can handle buffer_heads up to, say, 16x > > PAGE_SIZE > > I really just don't know the details of the various file systems.[*] Is this > something which could be hidden in Kconfig magic and just call this > kmap_local_folio()? > > My gut says that HIGHMEM systems don't need large block size FS's. So could > large block size FS's be limited to !HIGHMEM configs? They could, and that's the current approach, but it does seem plausible that we could support HIGHMEM systems with fs-block-size > PAGE_SIZE with only a little extra work. > [*] I only play a file system developer on TV. ;-) That's OK, I'm only pretending to be an MM developer. Keep quiet, and I think we can get away with this.