On Wed 06-11-24 19:55:23, Asahi Lina wrote: > On 11/5/24 7:16 AM, Dave Chinner wrote: > > On Tue, Nov 05, 2024 at 12:31:22AM +0900, Asahi Lina wrote: > > Unfortunately, the DAX infrastructure is independent of the page > > cache but is also tightly tied to PAGE_SIZE based inode->i_mapping > > index granularity. In a way, this is even more fundamental than the > > page cache issues we had to solve. That's because we don't have > > folios with their own locks and size tracking. In DAX, we use the > > inode->i_mapping xarray entry for a given file offset to -serialise > > access to the backing pfn- via lock bits held in the xarray entry. > > We also encode the size of the dax entry in bits held in the xarray > > entry. > > > > The filesystem needs to track dirty state with filesystem block > > granularity. Operations on filesystem blocks (e.g. partial writes, > > page faults) need to be co-ordinated across the entire filesystem > > block. This means we have to be able to lock a single filesystem > > block whilst we are doing instantiation, sub-block zeroing, etc. > > Ah, so it's about locking? I had a feeling that might be the case... Yes. About locking and general state tracking. > > Large folio support in the page cache provided this "single tracking > > object for a > PAGE_SIZE range" support needed to allow fsb > > > page_size in filesystems. The large folio spans the entire > > filesystem block, providing a single serialisation and state > > tracking for all the page cache operations needing to be done on > > that filesystem block. > > > > The DAX infrastructure needs the same changes for fsb > page size > > support. We have a limited number bits we can use for DAX entry > > state: > > > > /* > > * DAX pagecache entries use XArray value entries so they can't be mistaken > > * for pages. We use one bit for locking, one bit for the entry size (PMD) > > * and two more to tell us if the entry is a zero page or an empty entry that > > * is just used for locking. In total four special bits. > > * > > * If the PMD bit isn't set the entry has size PAGE_SIZE, and if the ZERO_PAGE > > * and EMPTY bits aren't set the entry is a normal DAX entry with a filesystem > > * block allocation. > > */ > > #define DAX_SHIFT (4) > > #define DAX_LOCKED (1UL << 0) > > #define DAX_PMD (1UL << 1) > > #define DAX_ZERO_PAGE (1UL << 2) > > #define DAX_EMPTY (1UL << 3) > > > > I *think* that we have at most PAGE_SHIFT worth of bits we can > > use because we only store the pfn part of the pfn_t in the dax > > entry. There are PAGE_SHIFT high bits in the pfn_t that hold > > pfn state that we mask out. > > > > Hence I think we can easily steal another 3 bits for storing an > > order - orders 0-4 are needed (3 bits) for up to 64kB on 4kB > > PAGE_SIZE - so I think this is a solvable problem. There's a lot > > more to it than "just use several pages to map to a single > > filesystem block", though..... > > Honestly, this is all quite over my head... my use case is virtiofs, > which I think is quite different to running a filesystem on bare-metal > DAX. It's starting to sound like we should perhaps just gate off the > check for virtiofs only? Yes, my point was that the solution should better be virtiofs specific because for anybody else blocksize > PAGE_SIZE is going to fail spectacularly. > >>> If virtiofs can actually map 4k subpages out of 16k page on > >>> host (and generally perform 4k granular tracking etc.), it would seem more > >>> appropriate if virtiofs actually exposed the filesystem 4k block size instead > >>> of 16k blocksize? Or am I missing something? > >> > >> virtiofs itself on the guest does 2MiB mappings into the SHM region, and > >> then the guest is free to map blocks out of those mappings. So as long > >> as the guest page size is less than 2MiB, it doesn't matter, since all > >> files will be aligned in physical memory to that block size. It behaves > >> as if the filesystem block size is 2MiB from the point of view of the > >> guest regardless of the actual block size. For example, if the host page > >> size is 16K, the guest will request a 2MiB mapping of a file, which the > >> VMM will satisfy by mmapping 128 16K pages from its page cache (at > >> arbitrary physical memory addresses) into guest "physical" memory as one > >> contiguous block. Then the guest will see the whole 2MiB mapping as > >> contiguous, even though it isn't in physical RAM, and it can use any > >> page granularity it wants (that is supported by the architecture) to map > >> it to a userland process. > > > > Clearly I'm missing something important because, from this > > description, I honestly don't know which mapping is actually using > > DAX. > > > > Can you draw out the virtofs stack from userspace in the guest down > > to storage in the host so dumb people like myself know exactly where > > what is being directly accessed and how? > > I'm not familiar with all of the details, but essentially virtiofs is > FUSE backed by a virtio device instead of userspace, plus the extra DAX > mapping stuff. Since it's not a real filesystem backed by a block > device, it has no significant concept of block size itself. i_blkbits > comes from the st_blksize of the inode stat, which in our case is passed > through from the underlying filesystem backing the virtiofs in the host > (but it could be anything, nothing says virtiofs has to be backed by a > real kernel FS in the host). > > So as a baseline, virtiofs is just FUSE and block size doesn't matter > since all the non-mmap filesystem APIs shouldn't care about block size > (other than for optimization reasons and issues with torn racy writes). > The guest should be able to pretend the block size is 4K for FS/VM > purposes even if it's 16K in the host, and track everything in the page > cache and DAX infrastructure in terms of 4K blocks. As far as I know > there is no operation in plain FUSE that actually cares about the block > size itself. > > So then there's DAX/mmap. When DAX is enabled, FUSE can issue > FUSE_SETUPMAPPING and FUSE_REMOVEMAPPING opcodes. These request that a > range of a file be mapped into the device memory region used by > virtiofs. When the VMM receives those, it will use mmap to map the > backing file into the guest's virtio device memory window, and then the > guest can use DAX to directly access those pages and allow userspace > processes to mmap them directly. This means that mmaps are coherent > between processes on the guest and the host (or in another guest), which > is the main reason we're doing this. > > If you look at fs/fuse/dax.c, you'll see that FUSE_DAX_SHIFT is 21. This > means that the FUSE code only ever issues > FUSE_SETUPMAPPING/FUSE_REMOVEMAPPING opcodes with offsets/lengths at > 2MiB granularity within files. So, regardless of the underlying > filesystem block size in the host (if there is one at all), the guest > will always see aligned 2MiB blocks of files available in its virtio > device region, similar to the hypothetical case of an actual > block-backed DAX filesystem with a 2MiB allocation block size. OK, I've read fs/fuse/dax.c and I agree that virtiofs works as if its block size would be 2MB because it not only maps space with this granularity but also allocates from underlying storage, tracks extent state etc. with this granularity. > We could cap st_blksize in the VMM to 4K, I guess? I don't know if that > would make more sense than removing the kernel check. On one hand, that > might result in less optimized I/O if userspace then does 4K writes. On > the other hand, if we report st_blksize as 16K to userspace then I guess > it could assume concurrent 16K writes cannot be torn, which is not the > case if the guest is using 4K pages and page cache blocks (at least not > until all the folio stuff is worked out for blocks > page size in both > the page cache and DAX layers). Yes, I think capping sb->s_blocksize at PAGE_SIZE for virtiofs will be the cleanest solution for now. > This WARN still feels like the wrong thing, though. Right now it is the > only thing in DAX code complaining on a page size/block size mismatch > (at least for virtiofs). If this is so important, I feel like there > should be a higher level check elsewhere, like something happening at > mount time or on file open. It should actually cause the operations to > fail cleanly. That's a fair point. Currently filesystems supporting DAX check for this in their mount code because there isn't really a DAX code that would get called during mount and would have enough information to perform the check. I'm not sure adding a new call just for this check makes a lot of sense. But if you have some good place in mind, please tell me. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR