Re: [PATCH v4 00/14] forcealign for xfs

Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx> · Thu, 05 Sep 2024 09:26:25 +0530

Thanks Dave for quick response. 

Dave Chinner <david@xxxxxxxxxxxxx> writes:

> On Wed, Sep 04, 2024 at 11:44:29PM +0530, Ritesh Harjani wrote:
>> John Garry <john.g.garry@xxxxxxxxxx> writes:
>> 
>> > This series is being spun off the block atomic writes for xfs
>> > series at [0].
>> >
>> > That series got too big.
>> >
>> > The actual forcealign patches are roughly the same in this
>> > series.
>> >
>> > Why forcealign?  In some scenarios to may be required to
>> > guarantee extent alignment and granularity.
>> >
>> > For example, for atomic writes, the maximum atomic write unit
>> > size would be limited at the extent alignment and granularity,
>> > guaranteeing that an atomic write would not span data present in
>> > multiple extents.
>> >
>> > forcealign may be useful as a performance tuning optimization in
>> > other scenarios.
>> >
>> > I decided not to support forcealign for RT devices here.
>> > Initially I thought that it would be quite simple of implement.
>> > However, I discovered through much testing and subsequent debug
>> > that this was not true, so I decided to defer support to
>> > later.
>> >
>> > Early development xfsprogs support is at:
>> > https://github.com/johnpgarry/xfsprogs-dev/commits/atomic-writes/
>> >
>> 
>> Hi John,
>> 
>> Thanks for your continued work on atomic write.  I went over the
>> XFS patch series and this is my understanding + some queries.
>> Could you please help with these.
>
> Hi Ritesh - to make it easier for everyone to read and reply to you
> emails, can you please word wrap the text at 72 columns?
>

argh! Sorry about that. I had formed my queries in a separate notes
application and copy pasted it here. Hopefully this time it will be ok.

>> 1. As I understand XFS untorn atomic write support is built on top
>> of FORCEALIGN feature (which this series is adding) which in turn
>> uses extsize hint feature underneath.
>
> Yes.
>
>>    Now extsize hint mainly controls the alignment of both
>>    "physical start" & "logical start" offset and extent length,
>>    correct?
>
> Yes.
>
>>    This is done using args->alignment for start aand
>>    args->prod/mode variables for extent length. Correct?
>
> Yes.
>
>>    - If say we are not able to allocate an aligned physical start?
>>    Then since extsize is just a hint we go ahead with whatever
>>    best available extent is right?
>
> No. The definition of "forced alignment" is that we guarantee
> aligned allocation to the extent size hint. i.e the extent size hint
> is not a hint anymore - it defines the alignment we are guaranteeing
> allocation will achieve.
>
> hence if we can't align the extent to the alignment provided, we
> fail the alignment.
>
>>    - also extsize looks to be only providing allocation side of hints. (not de-allocation). Correct?
>
> No. See the use of xfs_inode_alloc_unitsize() in all the places
> where we free space. Forced alignment extends this function to
> return the extent size, not the block size.
>

Sorry for not being explicit. For queries in point 1. above, I was
referring to extent size hint feature w/o FORCEALIGN. But I got the
gist from your response. Thanks!

>> 2. If say there is an append write i.e. the allocation is needed
>> to be done at EOF. Then we try for an exact bno (from eof block)
>> and aligned extent length, right?
>
> Yes. This works because the previous extent is exactly aligned,
> hence a contiguous allocation will continue to be correctly aligned
> due to the forced alignment constraints.
>
>>    i.e. xfs_bmap_btalloc_filestreams() ->
>>    xfs_bmap_btalloc_at_eof(ap, args); If it is not available then
>>    we try for nearby bno xfs_alloc_vextent_near_bno(args, target)
>>    and similar...
>
> yes, that's just the normal aligned allocation fallback path when
> exact allocation fails.
>
>> 3. It is the FORCEALIGN feature which _mandates_ both allocation
>> (by using extsize hint) and de-allocation to happen _only_ in
>> extsize chunks.
>>
>>    i.e. forcealign mandates -
>>    - the logical and physical start offset should be aligned as
>>    per args->alignment
>>    - extent length be aligned as per args->prod/mod.
>>      If above two cannot be satisfied then return -ENOSPC.
>
> Yes.
>
>> 
>>    - Does the unmapping of extents also only happens in extsize
>>    chunks (with forcealign)?
>
> Yes, via use of xfs_inode_alloc_unitsize() in the high level code
> aligning the fsbno ranges to be unmapped.
>
> Remember, force align requires both logical file offset and
> physical block number to be correctly aligned,

This is where I would like to double confirm it again. Even the
extsize hint feature (w/o FORCEALIGN) will try to allocate aligned
physical start and logical start file offset and length right?

(Or does extsize hint only restricts alignment to logical start file
offset + length and not the physical start?)

Also it looks like there is no difference with ATOMIC_WRITE AND
FORCEALIGN feature with XFS, correct? (except that ATOMIC_WRITE is
adding additional natural alignment restrictions on pos and len). 
So why maintain 2 separate on disk inode flags for FORCEALIGN AND
ATOMIC_WRITE?
- Do you foresee FORCEALIGN to be also used at other places w/o
ATOMIC_WRITE where feature differentiation between the two on an
inode is required?

- Does the same reasoning will hold for XFS_SB_FEAT_RO_COMPAT_FORCEALIGN
& XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES too?

- But why ro_compact for ATOMICWRITES? There aren't any on disk metadata
changes within XFS filesystem to support atomic writes, right? 
Is it something to just prevent users from destroying their own data
by not allowing a rw mount from an older kernel where users could do
unaligned writes to files marked for atomic writes?
Or is there any other reasoning to prevent XFS filesystem from becoming
inconsistent if an older kernel does a rw mount here.

> so unmap alignment
> has to be set up correctly at file offset level before we even know
> what extents underly the file range we need to unmap....
>
>>      If the start or end of the extent which needs unmapping is
>>      unaligned then we convert that extent to unwritten and skip,
>>      is it? (__xfs_bunmapi())
>
> The high level code should be aligning the start and end of the
> file range to be removed via xfs_inode_alloc_unitsize(). Hence 
> the low level __xfs_bunmapi() code shouldn't ever be encountering
> unaligned unmaps on force-aligned inodes.
>

Yes, but isn't this code snippet trying to handle a case when it finds an
unaligned extent during unmap? And what we are essentially trying to 
do here is leave the unwritten extent as is and if the found extent is
written then convert to unwritten and skip it (goto nodelete). This
means with forcealign if we encounter an unaligned extent then the file
will have this space reserved as is with extent marked unwritten. 

Is this understanding correct?

__xfs_bunmapi(...) 
{
    unsigned int alloc_fsb = xfs_inode_alloc_fsbsize(ip);
    <...>
    while (end != (xfs_fileoff_t)-1 && end >= start &
          (nexts == 0 || extno < nexts)) {
          <...>

          if (alloc_fsb == 1 || (flags & XFS_BMAPI_REMAP))
                goto delete;

         mod = xfs_bmap_alloc_unit_offset(ip, alloc_fsb,
                         del.br_startblock + del.br_blockcount);
         if (mod) {
                 /*
                  * Not aligned to allocation unit on the end.
                  * The extent could have been split into written
                  * and unwritten pieces, or we could just be
                  * unmapping part of it.  But we can't really
                  * get rid of part of an extent.
                  */
                 if (del.br_state == XFS_EXT_UNWRITTEN) {
                         /*
                          * This piece is unwritten, or we're not
                          * using unwritten extents.  Skip over it.
                          */
                         ASSERT((flags & XFS_BMAPI_REMAP) || end >= mod);
                         end -= mod > del.br_blockcount ?
                                 del.br_blockcount : mod;
                         if (end < got.br_startoff &&
                             !xfs_iext_prev_extent(ifp, &icur, &got)) {
                                 done = true;
                                 break;
                         }
                         continue;
                 }
                 /*
                  * It's written, turn it unwritten.
                  * This is better than zeroing it.
                  */
                 ASSERT(del.br_state == XFS_EXT_NORM);
                 ASSERT(tp->t_blk_res > 0);
                 /*
                  * If this spans an extent boundary, chop it back to
                  * the start of the one we end at.
                  */
                 if (del.br_blockcount > mod) {
                         del.br_startoff += del.br_blockcount - mod;
                         del.br_startblock += del.br_blockcount - mod;
                         del.br_blockcount = mod;
                 }
                 del.br_state = XFS_EXT_UNWRITTEN;
                 error = xfs_bmap_add_extent_unwritten_real(tp, ip,
                                 whichfork, &icur, &cur, &del,
                                 &logflags);
                 if (error)
                         goto error0;
                 goto nodelete;
         }

         mod = xfs_bmap_alloc_unit_offset(ip, alloc_fsb,
                                 del.br_startblock);
         if (mod) {
            // handle it for unaligned start block
            <...>
         }
    }
}

> -Dave.

Thanks a lot for answering the queries.

-ritesh