On Fri, Sep 15, 2023 at 08:38:25PM +0200, Pankaj Raghav wrote: > From: Pankaj Raghav <p.raghav@xxxxxxxxxxx> > > There has been efforts over the last 16 years to enable enable Large > Block Sizes (LBS), that is block sizes in filesystems where bs > page > size [1] [2]. Through these efforts we have learned that one of the > main blockers to supporting bs > ps in fiesystems has been a way to > allocate pages that are at least the filesystem block size on the page > cache where bs > ps [3]. Another blocker was changed in filesystems due to > buffer-heads. Thanks to these previous efforts, the surgery by Matthew > Willcox in the page cache for adopting xarray's multi-index support, and > iomap support, it makes supporting bs > ps in XFS possible with only a few > line change to XFS. Most of changes are to the page cache to support minimum > order folio support for the target block size on the filesystem. > > A new motivation for LBS today is to support high-capacity (large amount > of Terabytes) QLC SSDs where the internal Indirection Unit (IU) are > typically greater than 4k [4] to help reduce DRAM and so in turn cost > and space. In practice this then allows different architectures to use a > base page size of 4k while still enabling support for block sizes > aligned to the larger IUs by relying on high order folios on the page > cache when needed. It also enables to take advantage of these same > drive's support for larger atomics than 4k with buffered IO support in > Linux. As described this year at LSFMM, supporting large atomics greater > than 4k enables databases to remove the need to rely on their own > journaling, so they can disable double buffered writes [5], which is a > feature different cloud providers are already innovating and enabling > customers for through custom storage solutions. > > This series still needs some polishing and fixing some crashes, but it is > mainly targeted to get initial feedback from the community, enable initial > experimentation, hence the RFC. It's being posted now given the results from > our testing are proving much better results than expected and we hope to > polish this up together with the community. After all, this has been a 16 > year old effort and none of this could have been possible without that effort. > > Implementation: > > This series only adds the notion of a minimum order of a folio in the > page cache that was initially proposed by Willy. The minimum folio order > requirement is set during inode creation. The minimum order will > typically correspond to the filesystem block size. The page cache will > in turn respect the minimum folio order requirement while allocating a > folio. This series mainly changes the page cache's filemap, readahead, and > truncation code to allocate and align the folios to the minimum order set for the > filesystem's inode's respective address space mapping. > > Only XFS was enabled and tested as a part of this series as it has > supported block sizes up to 64k and sector sizes up to 32k for years. > The only thing missing was the page cache magic to enable bs > ps. However any filesystem > that doesn't depend on buffer-heads and support larger block sizes > already should be able to leverage this effort to also support LBS, > bs > ps. > > This also paves the way for supporting block devices where their logical > block size > page size in the future by leveraging iomap's address space > operation added to the block device cache by Christoph Hellwig [6]. We > have work to enable support for this, enabling LBAs > 4k on NVME, and > at the same time allow coexistence with buffer-heads on the same block > device so to enable support allow for a drive to use filesystem's to > switch between filesystem's which may depend on buffer-heads or need the > iomap address space operations for the block device cache. Patches for > this will be posted shortly after this patch series. Do you have a git tree branch that I can pull this from somewhere? As it is, I'd really prefer stuff that adds significant XFS functionality that we need to test to be based on a current Linus TOT kernel so that we can test it without being impacted by all the random unrelated breakages that regularly happen in linux-next kernels.... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx