Re: [PATCH 0/6] xfs: bunmapi needs updating for deferred freeing

Chandan Babu R <chandanrlinux@xxxxxxxxx> · Mon, 31 May 2021 15:32:05 +0530

On 27 May 2021 at 10:21, Dave Chinner wrote:
> Hi folks,
>
> I pulled on a loose thread when I started looking into the 64kB
> directory block size assert failure I was seeing while trying to
> test the bulk page allocation changes.
>
> I posted the first patch in the series separately - it fixed the
> immediate assert failure (5.13-rc1 regression) I was seeing, but in
> fixing that it only then dropped back to the previous assert failure
> that g/538 was triggering with 64kb directory block sizes. This can
> only be reproduced on 5.12, because that's when the error injection
> that g/538 uses was added. So I went looking deeper.
>
> It turns out that xfs_bunmapi() has some code in it to avoid locking
> AGFs in the wrong order and this is what was triggering. Many of the
> xfs_bunmapi() callers can not/do not handle partial unmaps that
> return success, and that's what the directory code is tripping over
> trying to free badly fragmented directory blocks.
>
> This AGF locking order constraint was added to xfs_bunmapu in 2017
> to avoid a deadlock in g/299. Sad thing is that shortly after this,
> we converted xfs-bunmapi to use deferred freeing, so it never
> actually locks AGFs anymore. But the deadlock avoiding landmine
> remained. And xfs_bmap_finish() went away, too, and we now only ever
> put one extent in any EFI we log for deferred freeing.

I did come across a scenario (when executing xfs/538 with 1k fs block size and
64k directory block size) where an EFI item contained three extents:

- Two of those extents belonged to the file whose extents were being freed.
- One more extent was added by xfs_bmap_btree_to_extents().
  The corresponding call trace was,
    CPU: 3 PID: 1367 Comm: fsstress Not tainted 5.12.0-rc8-next-20210419-chandan #125
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
    Call Trace:
     dump_stack+0x64/0x7c
     xfs_defer_add.cold+0x1d/0x22
     xfs_bmap_btree_to_extents+0x1f6/0x470
     __xfs_bunmapi+0x50a/0xe60
     ? xfs_trans_alloc_inode+0xbb/0x180
     xfs_bunmapi+0x15/0x30
     xfs_free_file_space+0x241/0x2c0
     xfs_file_fallocate+0x1ca/0x430
     ? __cond_resched+0x16/0x40
     ? inode_security+0x22/0x60
     ? selinux_file_permission+0xe2/0x120
     vfs_fallocate+0x146/0x2e0
     ioctl_preallocate+0x8f/0xc0
     __x64_sys_ioctl+0x62/0xb0
     do_syscall_64+0x40/0x80
     entry_SYSCALL_64_after_hwframe+0x44/0xae

>
> That means we now only free one extent per transaction via deferred
> freeing,

With three instances of xfs_extent_free_items associated with one instance of
xfs_defer_pending, xfs_defer_finish_noroll() would,
1. Create an EFI item containing information about the three extents to be
   freed.
   - The extents in xfs_defer_pending->dfp_work list are sorted based on AG
     number.
2. Roll the transaction.
3. The new transaction would,
   - Create an EFD item to hold information about the three extents to be
     freed.
   - Free the three extents in a single transaction.

> and there are no limitations on what order xfs_bunmapi()
> can unmap extents.

I think the sorting of extent items mentioned above is the reason that AG
locks are obtained in increasing AGNO order while freeing extents.

> 64kB directories on a 1kB block size filesystem
> already unmap 64 extents in a single loop, so there's no real
> limitation here.

I think, in the worst case, we can free atmost XFS_EFI_MAX_FAST_EXTENTS
(i.e. 16) extents in a single transaction assuming that they were all added
in a sequence without any non-XFS_DEFER_OPS_TYPE_FREE deferred objects
added in between.

>
> This means that the limitations of how many extents we can unmap per
> loop in xfs_itruncate_extents_flags() goes away for data device
> extents (and will eventually go away for RT devices, too, when
> Darrick's RT EFI stuff gets merged).
>
> This "one data deveice extent free per transaction" change now means
> that all of the transaction reservations that include
> "xfs_bmap_finish" based freeing reservations are wrong. These extent
> frees are now done by deferred freeing, and so they only need a
> single extent free reservation instead of up to 4 (as truncate was
> reserving).
>
> This series fixes the btree fork regression, the bunmapi partial
> unmap regression from 2017, extends xfs_itruncate_extents to unmap
> 64 extents at a time for data device (AG) resident extents, and
> reworks the transaction reservations to use a consistent and correct
> reservation for allocation and freeing extents. The size of some
> transaction reservations drops dramatically as a result.
>
> The first two patches are -rcX candidates, the rest are for the next
> merge cycle....
>

--
chandan