Re: [LSF/MM/BPF TOPIC] extsize and forcealign design in filesystems for atomic writes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jan 30, 2025 at 02:08:30PM +0000, John Garry wrote:
> On 29/01/2025 16:06, Ojaswin Mujoo wrote:
> > On Wed, Jan 29, 2025 at 08:59:15AM +0000, John Garry wrote:
> > > On 29/01/2025 07:06, Ojaswin Mujoo wrote:
> > > 
> > > Hi Ojaswin,
> > > 
> > > > 
> > > > I would like to submit a proposal to discuss the design of extsize and
> > > > forcealign and various open questions around it.
> > > > 
> > > >    ** Background **
> > > > 
> > > > Modern NVMe/SCSI disks with atomic write capabilities can allow writes to a
> > > > multi-KB range on disk to go atomically. This feature has a wide variety of use
> > > > cases especially for databases like mysql and postgres that can leverage atomic
> > > > writes to gain significant performance. However, in order to enable atomic
> > > > writes on Linux, the underlying disk may have some size and alignment
> > > > constraints that the upper layers like filesystems should follow. extsize with
> > > > forcealign is one of the ways filesystems can make sure the IO submitted to the
> > > > disk adheres to the atomic writes constraints.
> > > > 
> > > > extsize is a hint to the FS to allocate extents at a certian logical alignment
> > > > and size. forcealign builds on this by forcing the allocator to enforce the
> > > > alignment guarantees for physical blocks as well, which is essential for atomic
> > > > writes.
> > > > 
> > > >    ** Points of discussion **
> > > > 
> > > > Extsize hints feature is already supported by XFS [1] with forcealign still
> > > > under development and discussion [2].
> > > 
> > > From
> > > https://lore.kernel.org/linux-xfs/20241212013433.GC6678@frogsfrogsfrogs/ 
> > > thread, the alternate solution to forcealign for XFS is to use a
> > > software-emulated fallback for unaligned atomic writes. I am looking at a
> > > PoC implementation now. Note that this does rely on CoW.
> > > 
> > > There has been push back on forcealign for XFS, so we need to prove/disprove
> > > that this software-emulated fallback can work, see
> > > https://lore.kernel.org/linux-xfs/20240924061719.GA11211@xxxxxx/ 
> > > 
> > 
> > Hey John,
> > 
> > Thanks for taking a look. I did go through the 2 series sometime back.
> > I agree that there are some open challenges in getting the multi block
> > atomic write interface correct especially for mixed mappings and this is
> > one of the main reasons we want to explore the exchange_range fallback
> > in case blocks are not aligned.
> 
> Right, so for XFS I am looking at a CoW-based fallback for unaligned/mixed
> mapping atomic writes. I have no idea on how this could work for ext4.
> 
> > 
> > That being said, I believe forcealign as a feature still holds a lot
> > of relevance as:
> > 
> > 1. Right now, it is the only way to guarantee aligned blocks and hence
> >     gurantee that our atomic writes can always benefit from hardware atomic
> >     write support. IIUC DBs are not very keen on losing out on performance
> >     due to some writes going via the software fallback path.
> 
> Sure, we need performance figures for this first.
> 
> > 
> > 2. Not all FSes support COW (major example being ext4) and hence it will
> >     be very difficult to have a software fallback incase the blocks are
> > 	 not aligned.
> 
> Understood
> 
> > 
> > 3. As pointed out in [1], even with exchange_range there is still value
> >     in having forcealign to find the new blocks to be exchanged.
> 
> Yeah, again, we need performance figures.
> 
> For my test case, I am trying 16K atomic writes with 4K FS block size, so I
> expect the software fallback to not kick in often after running the system
> for a while (as eventually we will get an aligned allocations). I am
> concerned of prospect of heavily fragmented files, though.

Yes that's true, if the FS is up long enough there is bound to be
fragmentation eventually which might make it harder for extsize to
get the blocks.

With software fallback, there's again the point that many FSes will need
some sort of COW/exchange_range support before they can support anything
like that. 

Although I;ve not looked at what it will take to add that to
ext4 but I'm assuming it will not be trivial at all. 

> 
> > 
> > I agree that forcealign is not the only way we can have atomic writes
> > work but I do feel there is value in having forcealign for FSes and
> > hence we should have a discussion around it so we can get the interface
> > right.
> > 
> 
> I thought that the interface for forcealign according to the candidate xfs
> implementation was quite straightforward. no?

As mentioned in the original proposal, there are still a open problems
around extsize and forcealign. 

- The allocation and deallocation semantics are not completely clear to
	me for example we allow operations like unaligned punch_hole but not
	unaligned insert and collapse range, and I couldn't see that
	documented anywhere.

- There are challenges in extsize with delayed allocation as well as how
	the tooling should handle forcealigned inodes. 

- How are FSes supposed to behave when forcealign/extsize is used with
	other FS features that change the allocation granularity like bigalloc
	or rtvol.

I agree that XFS's implementation is a good reference but I'm
sure as I continue working on the same from ext4 perspective we will have 
more points of discussion. So I definitely feel that its worth
discussing this at LSFMM.

> 
> What was not clear was the age-old issue of how to issue an atomic write of
> mixed extents, which is really an atomic write issue.

Right, btw are you planning any talk for atomic writes at LSFMM?

Regards,
ojaswin

> 
> > Just to be clear, the intention of this proposal is to mainly discuss
> > forcealign as a feature. I am hoping there would be another different
> > proposal to discuss atomic writes and the plethora of other open
> > challenges there ;)
> 
> Thanks,
> John





[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux