Dave Chinner <david@xxxxxxxxxxxxx> writes: > On Thu, Sep 05, 2024 at 09:26:25AM +0530, Ritesh Harjani wrote: >> Dave Chinner <david@xxxxxxxxxxxxx> writes: >> > On Wed, Sep 04, 2024 at 11:44:29PM +0530, Ritesh Harjani wrote: >> >> 3. It is the FORCEALIGN feature which _mandates_ both allocation >> >> (by using extsize hint) and de-allocation to happen _only_ in >> >> extsize chunks. >> >> >> >> i.e. forcealign mandates - >> >> - the logical and physical start offset should be aligned as >> >> per args->alignment >> >> - extent length be aligned as per args->prod/mod. >> >> If above two cannot be satisfied then return -ENOSPC. >> > >> > Yes. >> > >> >> >> >> - Does the unmapping of extents also only happens in extsize >> >> chunks (with forcealign)? >> > >> > Yes, via use of xfs_inode_alloc_unitsize() in the high level code >> > aligning the fsbno ranges to be unmapped. >> > >> > Remember, force align requires both logical file offset and >> > physical block number to be correctly aligned, >> >> This is where I would like to double confirm it again. Even the >> extsize hint feature (w/o FORCEALIGN) will try to allocate aligned >> physical start and logical start file offset and length right? > > No. > >> (Or does extsize hint only restricts alignment to logical start file >> offset + length and not the physical start?) > > Neither. > Yes, thanks for the correction. Indeed extsize hint does not take care of the physical start alignment at all. > extsize hint by itself (i.e. existing behaviour) has no alignment > effect at all. All it affects is -size- of the extent. i.e. once > the extent start is chosen, extent size hints will trim the length > of the extent to a multiple of the extent size hint. Alignment is > not considered at all. > Please correct me I wrong here... but XFS considers aligning the logical start and the length of the allocated extent (for extsize) as per below code right? i.e. 1) xfs_direct_write_iomap_begin() { <...> if (offset + length > XFS_ISIZE(ip)) end_fsb = xfs_iomap_eof_align_last_fsb(ip, end_fsb); => xfs_fileoff_t aligned_end_fsb = roundup_64(end_fsb, align); return aligned_end_fsb } 2) xfs_bmap_compute_alignments() { <...> else if (ap->datatype & XFS_ALLOC_USERDATA) align = xfs_get_extsz_hint(ap->ip); if (align) { if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0, ap->eof, 0, ap->conv, &ap->offset, &ap->length)) ASSERT(0); ASSERT(ap->length); args->prod = align; div_u64_rem(ap->offset, args->prod, &args->mod); if (args->mod) args->mod = args->prod - args->mod; } <...> } So args->prod and args->mod... aren't they use to align the logical start and the length of the extent? However, I do notice that when the file is closed XFS trims the length allocated beyond EOF boundary (for extsize but not for forcealign from the new forcealign series) i.e. xfs_file_release() -> xfs_release() -> xfs_free_eofblocks() I guess that is because xfs_can_free_eofblocks() does not consider alignment for extsize in this function xfs_can_free_eofblocks() { <...> end_fsb = xfs_inode_roundup_alloc_unit(ip, XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_ISIZE(ip))); <...> } >> Also it looks like there is no difference with ATOMIC_WRITE AND >> FORCEALIGN feature with XFS, correct? (except that ATOMIC_WRITE is >> adding additional natural alignment restrictions on pos and len). > > Atomic write requires additional hardware support, and it restricts > the valid sizes of extent size hints that can be set. Only atomic > writes can be done on files marked as configured for atomic writes; > force alignment can be done on any file... > >> So why maintain 2 separate on disk inode flags for FORCEALIGN AND >> ATOMIC_WRITE? > > the atomic write flag indicates that a file has been set up > correctly for atomic writes to be able to issues reliably. force > alignment doesn't guarantee that - it's just a mechanism that tells > the allocator to behave a specific way. > >> - Do you foresee FORCEALIGN to be also used at other places w/o >> ATOMIC_WRITE where feature differentiation between the two on an >> inode is required? > > The already exist. For example, reliably allocating huge page > mappings on DAX filesystems requires 2MB forced alignment. > >> - Does the same reasoning will hold for XFS_SB_FEAT_RO_COMPAT_FORCEALIGN >> & XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES too? > > Same as above. > >> - But why ro_compact for ATOMICWRITES? There aren't any on disk metadata >> changes within XFS filesystem to support atomic writes, right? > > Because if you downgrade the kernel to something that doesn't > support atomic writes, then non-atomic sized/aligned data can be > written to the file and/or torn writes can occur. > > Worse, extent size hints that don't match the underlying hardware > support could be set up for inodes, and when the kernel is upgraded > again then atomic writes will fail on inodes that have atomic write > flags set on them.... > >> Is it something to just prevent users from destroying their own data >> by not allowing a rw mount from an older kernel where users could do >> unaligned writes to files marked for atomic writes? >> Or is there any other reasoning to prevent XFS filesystem from becoming >> inconsistent if an older kernel does a rw mount here. > > The older kernel does not know what the unknown inode flag means > (i.e. atomic writes) and so, by definition, we cannot allow it to > modify metadata or file data because it may not modify it in the > correct way for that flag being set on the inode. > > Kernels that don't understand feature flags need to treat the > filesystem as read-only, no matter how trivial the feature addition > might seem. > >> > so unmap alignment >> > has to be set up correctly at file offset level before we even know >> > what extents underly the file range we need to unmap.... >> > >> >> If the start or end of the extent which needs unmapping is >> >> unaligned then we convert that extent to unwritten and skip, >> >> is it? (__xfs_bunmapi()) >> > >> > The high level code should be aligning the start and end of the >> > file range to be removed via xfs_inode_alloc_unitsize(). Hence >> > the low level __xfs_bunmapi() code shouldn't ever be encountering >> > unaligned unmaps on force-aligned inodes. >> > >> >> Yes, but isn't this code snippet trying to handle a case when it finds an >> unaligned extent during unmap? > > It does exactly that. > >> And what we are essentially trying to >> do here is leave the unwritten extent as is and if the found extent is >> written then convert to unwritten and skip it (goto nodelete). This >> means with forcealign if we encounter an unaligned extent then the file >> will have this space reserved as is with extent marked unwritten. >> >> Is this understanding correct? > > Yes, except for the fact that this code is not triggered by force > alignment. > > This code is preexisting realtime file functionality. It is used > when the rtextent size is larger than a single filesytem block. > > In these configurations, we do allocation in rtextsize units, but we > track written/unwritten extents in the BMBT on filesystem block > granularity. Hence we can have unaligned written/unwritten extent > boundaries, and if we aren't unmapping a whole rtextent then we > simply mark the unused portion of it as unwritten in the BMBT to > indicate it contains zeroes. > > IOWs, existing RT files have different IO alignment and > written/unwritten extent conversion behaviour to the new forced > alignment feature. The implementation code is shared in many places, > but the higher level forced alignment functionality ensures there > are never unaligned unwritten extents created or unaligned > unmappings asked for. Hence this code does not trigger for the > forced alignment cases. > > i.e. we have multiple seperate allocation alignment behaviours that > share an implementation, but they do not all behave exactly the same > way or provide the same alignment guarantees.... > Thanks for taking time and explaining this. -ritesh