Re: [PATCH v4 00/14] forcealign for xfs

Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx> · Tue, 10 Sep 2024 08:21:55 +0530

Dave Chinner <david@xxxxxxxxxxxxx> writes:

> On Thu, Sep 05, 2024 at 09:26:25AM +0530, Ritesh Harjani wrote:
>> Dave Chinner <david@xxxxxxxxxxxxx> writes:
>> > On Wed, Sep 04, 2024 at 11:44:29PM +0530, Ritesh Harjani wrote:
>> >> 3. It is the FORCEALIGN feature which _mandates_ both allocation
>> >> (by using extsize hint) and de-allocation to happen _only_ in
>> >> extsize chunks.
>> >>
>> >>    i.e. forcealign mandates -
>> >>    - the logical and physical start offset should be aligned as
>> >>    per args->alignment
>> >>    - extent length be aligned as per args->prod/mod.
>> >>      If above two cannot be satisfied then return -ENOSPC.
>> >
>> > Yes.
>> >
>> >> 
>> >>    - Does the unmapping of extents also only happens in extsize
>> >>    chunks (with forcealign)?
>> >
>> > Yes, via use of xfs_inode_alloc_unitsize() in the high level code
>> > aligning the fsbno ranges to be unmapped.
>> >
>> > Remember, force align requires both logical file offset and
>> > physical block number to be correctly aligned,
>> 
>> This is where I would like to double confirm it again. Even the
>> extsize hint feature (w/o FORCEALIGN) will try to allocate aligned
>> physical start and logical start file offset and length right?
>
> No.
>
>> (Or does extsize hint only restricts alignment to logical start file
>> offset + length and not the physical start?)
>
> Neither.
>

Yes, thanks for the correction. Indeed extsize hint does not take care
of the physical start alignment at all.

> extsize hint by itself (i.e. existing behaviour) has no alignment
> effect at all. All it affects is -size- of the extent. i.e. once
> the extent start is chosen, extent size hints will trim the length
> of the extent to a multiple of the extent size hint. Alignment is
> not considered at all.
>

Please correct me I wrong here... but XFS considers aligning the logical
start and the length of the allocated extent (for extsize) as per below
code right? 

i.e.
1) xfs_direct_write_iomap_begin()
{
    <...>
    if (offset + length > XFS_ISIZE(ip))
		end_fsb = xfs_iomap_eof_align_last_fsb(ip, end_fsb);
                  => xfs_fileoff_t aligned_end_fsb = roundup_64(end_fsb, align);
                     return aligned_end_fsb
}

2) xfs_bmap_compute_alignments()
{
    <...>
    	else if (ap->datatype & XFS_ALLOC_USERDATA)
		     align = xfs_get_extsz_hint(ap->ip);

        if (align) {
            if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
                        ap->eof, 0, ap->conv, &ap->offset,
                        &ap->length))
                ASSERT(0);
            ASSERT(ap->length);

            args->prod = align;
            div_u64_rem(ap->offset, args->prod, &args->mod);
            if (args->mod)
                args->mod = args->prod - args->mod;
        }
        <...>
}

So args->prod and args->mod... aren't they use to align the logical
start and the length of the extent?

However, I do notice that when the file is closed XFS trims the length
allocated beyond EOF boundary (for extsize but not for forcealign from
the new forcealign series) i.e.

xfs_file_release() -> xfs_release() -> xfs_free_eofblocks()

I guess that is because xfs_can_free_eofblocks() does not consider
alignment for extsize in this function 

xfs_can_free_eofblocks()
{
<...>
	end_fsb = xfs_inode_roundup_alloc_unit(ip,
			XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_ISIZE(ip)));
<...>
}

>> Also it looks like there is no difference with ATOMIC_WRITE AND
>> FORCEALIGN feature with XFS, correct? (except that ATOMIC_WRITE is
>> adding additional natural alignment restrictions on pos and len). 
>
> Atomic write requires additional hardware support, and it restricts
> the valid sizes of extent size hints that can be set. Only atomic
> writes can be done on files marked as configured for atomic writes;
> force alignment can be done on any file...
>
>> So why maintain 2 separate on disk inode flags for FORCEALIGN AND
>> ATOMIC_WRITE?
>
> the atomic write flag indicates that a file has been set up
> correctly for atomic writes to be able to issues reliably. force
> alignment doesn't guarantee that - it's just a mechanism that tells
> the allocator to behave a specific way.
>
>> - Do you foresee FORCEALIGN to be also used at other places w/o
>> ATOMIC_WRITE where feature differentiation between the two on an
>> inode is required?
>
> The already exist. For example, reliably allocating huge page
> mappings on DAX filesystems requires 2MB forced alignment. 
>
>> - Does the same reasoning will hold for XFS_SB_FEAT_RO_COMPAT_FORCEALIGN
>> & XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES too?
>
> Same as above.
>
>> - But why ro_compact for ATOMICWRITES? There aren't any on disk metadata
>> changes within XFS filesystem to support atomic writes, right? 
>
> Because if you downgrade the kernel to something that doesn't
> support atomic writes, then non-atomic sized/aligned data can be
> written to the file and/or torn writes can occur.
>
> Worse, extent size hints that don't match the underlying hardware
> support could be set up for inodes, and when the kernel is upgraded
> again then atomic writes will fail on inodes that have atomic write
> flags set on them....
>
>> Is it something to just prevent users from destroying their own data
>> by not allowing a rw mount from an older kernel where users could do
>> unaligned writes to files marked for atomic writes?
>> Or is there any other reasoning to prevent XFS filesystem from becoming
>> inconsistent if an older kernel does a rw mount here.
>
> The older kernel does not know what the unknown inode flag means
> (i.e. atomic writes) and so, by definition, we cannot allow it to
> modify metadata or file data because it may not modify it in the
> correct way for that flag being set on the inode.
>
> Kernels that don't understand feature flags need to treat the
> filesystem as read-only, no matter how trivial the feature addition
> might seem.
>
>> > so unmap alignment
>> > has to be set up correctly at file offset level before we even know
>> > what extents underly the file range we need to unmap....
>> >
>> >>      If the start or end of the extent which needs unmapping is
>> >>      unaligned then we convert that extent to unwritten and skip,
>> >>      is it? (__xfs_bunmapi())
>> >
>> > The high level code should be aligning the start and end of the
>> > file range to be removed via xfs_inode_alloc_unitsize(). Hence 
>> > the low level __xfs_bunmapi() code shouldn't ever be encountering
>> > unaligned unmaps on force-aligned inodes.
>> >
>> 
>> Yes, but isn't this code snippet trying to handle a case when it finds an
>> unaligned extent during unmap?
>
> It does exactly that.
>
>> And what we are essentially trying to 
>> do here is leave the unwritten extent as is and if the found extent is
>> written then convert to unwritten and skip it (goto nodelete). This
>> means with forcealign if we encounter an unaligned extent then the file
>> will have this space reserved as is with extent marked unwritten. 
>> 
>> Is this understanding correct?
>
> Yes, except for the fact that this code is not triggered by force
> alignment.
>
> This code is preexisting realtime file functionality. It is used
> when the rtextent size is larger than a single filesytem block.
>
> In these configurations, we do allocation in rtextsize units, but we
> track written/unwritten extents in the BMBT on filesystem block
> granularity. Hence we can have unaligned written/unwritten extent
> boundaries, and if we aren't unmapping a whole rtextent then we
> simply mark the unused portion of it as unwritten in the BMBT to
> indicate it contains zeroes.
>
> IOWs, existing RT files have different IO alignment and
> written/unwritten extent conversion behaviour to the new forced
> alignment feature. The implementation code is shared in many places,
> but the higher level forced alignment functionality ensures there
> are never unaligned unwritten extents created or unaligned
> unmappings asked for. Hence this code does not trigger for the
> forced alignment cases.
>
> i.e. we have multiple seperate allocation alignment behaviours that
> share an implementation, but they do not all behave exactly the same
> way or provide the same alignment guarantees....
>

Thanks for taking time and explaining this. 

-ritesh