On 27 May 2021 at 10:21, Dave Chinner wrote: > Hi folks, > > I pulled on a loose thread when I started looking into the 64kB > directory block size assert failure I was seeing while trying to > test the bulk page allocation changes. > > I posted the first patch in the series separately - it fixed the > immediate assert failure (5.13-rc1 regression) I was seeing, but in > fixing that it only then dropped back to the previous assert failure > that g/538 was triggering with 64kb directory block sizes. This can > only be reproduced on 5.12, because that's when the error injection > that g/538 uses was added. So I went looking deeper. > > It turns out that xfs_bunmapi() has some code in it to avoid locking > AGFs in the wrong order and this is what was triggering. Many of the > xfs_bunmapi() callers can not/do not handle partial unmaps that > return success, and that's what the directory code is tripping over > trying to free badly fragmented directory blocks. > > This AGF locking order constraint was added to xfs_bunmapu in 2017 > to avoid a deadlock in g/299. Sad thing is that shortly after this, > we converted xfs-bunmapi to use deferred freeing, so it never > actually locks AGFs anymore. But the deadlock avoiding landmine > remained. And xfs_bmap_finish() went away, too, and we now only ever > put one extent in any EFI we log for deferred freeing. I did come across a scenario (when executing xfs/538 with 1k fs block size and 64k directory block size) where an EFI item contained three extents: - Two of those extents belonged to the file whose extents were being freed. - One more extent was added by xfs_bmap_btree_to_extents(). The corresponding call trace was, CPU: 3 PID: 1367 Comm: fsstress Not tainted 5.12.0-rc8-next-20210419-chandan #125 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 Call Trace: dump_stack+0x64/0x7c xfs_defer_add.cold+0x1d/0x22 xfs_bmap_btree_to_extents+0x1f6/0x470 __xfs_bunmapi+0x50a/0xe60 ? xfs_trans_alloc_inode+0xbb/0x180 xfs_bunmapi+0x15/0x30 xfs_free_file_space+0x241/0x2c0 xfs_file_fallocate+0x1ca/0x430 ? __cond_resched+0x16/0x40 ? inode_security+0x22/0x60 ? selinux_file_permission+0xe2/0x120 vfs_fallocate+0x146/0x2e0 ioctl_preallocate+0x8f/0xc0 __x64_sys_ioctl+0x62/0xb0 do_syscall_64+0x40/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae > > That means we now only free one extent per transaction via deferred > freeing, With three instances of xfs_extent_free_items associated with one instance of xfs_defer_pending, xfs_defer_finish_noroll() would, 1. Create an EFI item containing information about the three extents to be freed. - The extents in xfs_defer_pending->dfp_work list are sorted based on AG number. 2. Roll the transaction. 3. The new transaction would, - Create an EFD item to hold information about the three extents to be freed. - Free the three extents in a single transaction. > and there are no limitations on what order xfs_bunmapi() > can unmap extents. I think the sorting of extent items mentioned above is the reason that AG locks are obtained in increasing AGNO order while freeing extents. > 64kB directories on a 1kB block size filesystem > already unmap 64 extents in a single loop, so there's no real > limitation here. I think, in the worst case, we can free atmost XFS_EFI_MAX_FAST_EXTENTS (i.e. 16) extents in a single transaction assuming that they were all added in a sequence without any non-XFS_DEFER_OPS_TYPE_FREE deferred objects added in between. > > This means that the limitations of how many extents we can unmap per > loop in xfs_itruncate_extents_flags() goes away for data device > extents (and will eventually go away for RT devices, too, when > Darrick's RT EFI stuff gets merged). > > This "one data deveice extent free per transaction" change now means > that all of the transaction reservations that include > "xfs_bmap_finish" based freeing reservations are wrong. These extent > frees are now done by deferred freeing, and so they only need a > single extent free reservation instead of up to 4 (as truncate was > reserving). > > This series fixes the btree fork regression, the bunmapi partial > unmap regression from 2017, extends xfs_itruncate_extents to unmap > 64 extents at a time for data device (AG) resident extents, and > reworks the transaction reservations to use a consistent and correct > reservation for allocation and freeing extents. The size of some > transaction reservations drops dramatically as a result. > > The first two patches are -rcX candidates, the rest are for the next > merge cycle.... > -- chandan