On Sat, Feb 23, 2019 at 10:30:38AM +1100, Dave Chinner wrote: > On Fri, Feb 22, 2019 at 10:45:25AM -0800, Darrick J. Wong wrote: > > On Fri, Feb 22, 2019 at 10:28:15AM -0800, Dan Williams wrote: > > > On Fri, Feb 22, 2019 at 10:21 AM Darrick J. Wong > > > <darrick.wong@xxxxxxxxxx> wrote: > > > > > > > > Hi all! > > > > > > > > Uh, we have an internal customer <cough> who's been trying out MAP_SYNC > > > > on pmem, and they've observed that one has to do a fair amount of > > > > legwork (in the form of mkfs.xfs parameters) to get the kernel to set up > > > > 2M PMD mappings. They (of course) want to mmap hundreds of GB of pmem, > > > > so the PMD mappings are much more efficient. > > Are you really saying that "mkfs.xfs -d su=2MB,sw=1 <dev>" is > considered "too much legwork" to set up the filesystem for DAX and > PMD alignment? Yes. I mean ... userspace /can/ figure out the page sizes on arm64 & ppc64le (or extract it from sysfs), but why not just advertise it as a io hint on the pmem "block" device? Hmm, now having watched various xfstests blow up because they don't expect blocks to be larger than 64k, maybe I'll rethink this as a default behavior. :) > > > > I started poking around w.r.t. what mkfs.xfs was doing and realized that > > > > if the fsdax pmem device advertised iomin/ioopt of 2MB, then mkfs will > > > > set up all the parameters automatically. Below is my ham-handed attempt > > > > to teach the kernel to do this. > > Still need extent size hints so that writes that are smaller than > the PMD size are allocated correctly aligned and sized to map to > PMDs... I think we're generally planning to use the RT device where we can make 2M alignment mandatory, so for the data device the effectiveness of the extent hint doesn't really matter. > > > > Comments, flames, "WTF is this guy smoking?" are all welcome. :) > > > > > > > > --D > > > > > > > > --- > > > > Configure pmem devices to advertise the default page alignment when said > > > > block device supports fsdax. Certain filesystems use these iomin/ioopt > > > > hints to try to create aligned file extents, which makes it much easier > > > > for mmaps to take advantage of huge page table entries. > > > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > > > > --- > > > > drivers/nvdimm/pmem.c | 5 ++++- > > > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c > > > > index bc2f700feef8..3eeb9dd117d5 100644 > > > > --- a/drivers/nvdimm/pmem.c > > > > +++ b/drivers/nvdimm/pmem.c > > > > @@ -441,8 +441,11 @@ static int pmem_attach_disk(struct device *dev, > > > > blk_queue_logical_block_size(q, pmem_sector_size(ndns)); > > > > blk_queue_max_hw_sectors(q, UINT_MAX); > > > > blk_queue_flag_set(QUEUE_FLAG_NONROT, q); > > > > - if (pmem->pfn_flags & PFN_MAP) > > > > + if (pmem->pfn_flags & PFN_MAP) { > > > > blk_queue_flag_set(QUEUE_FLAG_DAX, q); > > > > + blk_queue_io_min(q, PFN_DEFAULT_ALIGNMENT); > > > > + blk_queue_io_opt(q, PFN_DEFAULT_ALIGNMENT); > > > > > > The device alignment might sometimes be bigger than this default. > > > Would there be any detrimental effects for filesystems if io_min and > > > io_opt were set to 1GB? > > > > Hmmm, that's going to be a struggle on ext4 and the xfs data device > > because we'd be preferentially skipping the 1023.8MB immediately after > > each allocation group's metadata. It already does this now with a 2MB > > io hint, but losing 1.8MB here and there isn't so bad. > > > > We'd have to study it further, though; filesystems historically have > > interpreted the iomin/ioopt hints as RAID striping geometry, and I don't > > think very many people set up 1GB raid stripe units. > > Setting sunit=1GB is really going to cause havoc with things like > inode chunk allocation alignment, and the first write() will either > have to be >=1GB or use 1GB extent size hints to trigger alignment. > And, AFAICT, it will prevent us from doing 2MB alignment on other > files, even with 2MB extent size hints set. > > IOWs, I don't think 1GB alignment is a good idea as a default. <nods> > > (I doubt very many people have done 2M raid stripes either, but it seems > > to work easily where we've tried it...) > > That's been pretty common with stacked hardware raid for as long as > I've worked on XFS. e.g. a software RAID0 stripe of hardware RAID5/6 > luns was pretty common with large storage arrays in HPC environments > (i.e. huge streaming read/write bandwidth). In these cases, XFS was > set up with the RAID5/6 lun width as the stripe unit (commonly 2MB > with 8+1 and 256k raid chunk size), and the RAID 0 > width as the stripe width (commonly 8-16 wide spread across 8-16 FC > portsi w/ multipath) and it wasn't uncommon to see widths in the > 16-32MB range. > > This aligned the filesystem to the underlying RAID5/6 luns, and > allows stripe width IO to be aligned an hit every RAID5/6 lun > evenly. Ensuring applications could do this easily with large direct > IO reads and writes is where the swalloc and largeio mount > options come into their own.... <nod> > > > I'm thinking and xfs-realtime configuration might be able to support > > > 1GB mappings in the future. > > > > The xfs realtime device ought to be able to support 1g alignment pretty > > easily though. :) > > Yup, but I think that's the maximum "block" size it can support and It is; our users with 16G page size are out of luck. > DAX will have some serious long tail latency and CPU usage issues at > allocation time because each new 1GB "block" that is dynamically > allocated will have to be completely zeroed during the allocation > inside the page fault handler..... Agreed, 1G pages are most probably too unwieldly to be worth advertising. --D > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx