Re: [PATCH 1/3] xfs: pass alloc flags through to xfs_extent_busy_flush()

Dave Chinner <david@xxxxxxxxxxxxx> · Sat, 17 Jun 2023 08:29:36 +1000

On Fri, Jun 16, 2023 at 05:43:42PM +0000, Wengang Wang wrote:
> > On Jun 16, 2023, at 12:36 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > On Fri, Jun 16, 2023 at 04:27:32AM +0000, Wengang Wang wrote:
> >>> On Jun 15, 2023, at 5:42 PM, Wengang Wang <wen.gang.wang@xxxxxxxxxx> wrote:
> >>>> On Jun 15, 2023, at 5:17 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> >>>> Can you please post that debug and analysis, rather than just a
> >>>> stack trace that is completely lacking in context? Nothing can be
> >>>> inferred from a stack trace, and what you are saying is occurring
> >>>> does not match what the code should actually be doing. So I need to
> >>>> actually look at what is happening in detail to work out where this
> >>>> mismatch is coming from....
> >>> 
> >>> The debug patch was based on my previous patch, I will rework the debug patch
> >>> basing on yours. I will share you the debug patch, output and my analysis later. 
> >>> 
> >> 
> >> My analysis:
> >> It’s problem of double free. The first free is from xfs_efi_item_recover(), the
> >> second free is from xfs_extent_free_finish_item().
> >> Dave’s patch makes it possible that xfs_trans_free_extent() returns -EAGAIN.
> >> The double free problem begins when the -EAGAIN is returned.
> >> 
> >> 1. -EAGAIN returned by xfs_trans_free_extent(),  see line 5 in the debug output.
> >> 2. according to -EAGAIN, xfs_free_extent_later() is called to create a deferred
> >>   operation of type XFS_DEFER_OPS_TYPE_AGFL_FREE  to current transaction.
> >>   see line 6.
> >> 3. New EFI (xfs_efi_log_item) is created and attached to current transaction.
> >>   And the deferred options is moved to capture_list. see line 9. The call path is:
> >>   xfs_efi_item_recover()
> >>     xfs_defer_ops_capture_and_commit(tp, capture_list)
> >>       xfs_defer_ops_capture()
> >>         xfs_defer_create_intents()
> >>           xfs_defer_create_intent()
> >>             ops->create_intent() —>   xfs_extent_free_create_intent()
> >> 4. The new EFI is committed with current transaction. see line 10.
> >> 5. The mount process (log recover) continue to work with other intents. line 11 and line 12.
> >> 6. The committed new EFI was added to AIL after log flush by a parallel thread. line 20.
> >> 7. The mount process (log recover) continue work with other intents. line 21 to line 35.
> >> 8. The mount process (log recover) come to process the new EFI that was added to AIL
> >>   at step 6. I guess the previous log items, which were added to AIL together with the
> >>   EFI, were removed by xfsaild (I didn’t log that) so they didn’t appear in the AIL intents
> >>   iteration in xlog_recover_process_intents().  see line 36.
> >> 9. The new EFI record (0x25441ca, 0x30) is freed by xfs_efi_item_recover(). \
> >>   That’s the first free. see line 37.
> >> 10. The AIL intents iteration is done. It begins to process the capture_list.
> >> 11. it comes to the XFS_DEFER_OPS_TYPE_AGFL_FREE deferred operation which is
> >>    added in step 3. xfs_extent_free_finish_item() is called to free (2nd free) (0x25441ca, 0x30)
> >>    and it failed because (0x25441ca, 0x30) was already freed at step 9. see line 43.
> >>    So, it’s a double free issue.
> > 
> > Yes, this much I understand.
> > 
> > As I've been saying all along, there's something else wrong here,
> > because we should never encounter that new EFI in the recovery loop.
> > The recovery loop should see some other type of item and abort
> > before it sees that EFI.
> 
> Maybe you are right. But here is my concern, its possible that:
> Even the were other types of items placed in AIL before that new EFI
> (after your change list_add() -> list_add_tail()), suppose this scenario:
> 
> # suppose there were many log intents after the Original EFI (which needs retry)
>    to process. 
> 1.  Non EFI type log items (say Item A to Item C) were added to AIL.
> 2.  That new EFI log item, Item D was added to AIL.
> 3.  Now the recovery process the remaining log intents. it takes long enough
>      to let step 4 happen. 
> 4.  During the time that the mount (log recover) process was processing the
>      remaining intents (suppose that needs long enough time), Item A, Item B
>      and Item C (on AIL) were processed by xfs_aild (iop_push()) and them was
>      removed from AIL later.  Possible?
> 5.  In the recovery intents loop, it comes to the first new log item, that is Item D,
>      the new EFI.  (Item A to Item C didn’t come before Item D because they were
>      removed in step 4).
> 6.  the records in the new EFI was freed by xfs_efi_item_recover()
> 7.  the same records in the same new EFI was freed by xfs_extent_free_finish_item()
>      again.
> 
> If above is possible, the problem still exists even you make sure the new log
> items in original order when they are placed to AIL.

Yes, I *know this can happen*.

What I've been trying to understand is *how the bug occurred in the
first place*. 

Root cause analysis comes first. Second is confirming the analysis
is correct. Only then do we really start thinking about how to fix
it properly.

We are at the second step: I need *empirical confirmation* that this
ordering problem is the root cause of the behaviour that was
observed.

Stop trying to fix the problem before we understand how it happened!

> > Go back to your trace-cmd shell and ctrl-c it. Wait for it to dump
> > all the trace events to the trace.dat file. Then run:
> > 
> > # trace-cmd report > trace.txt
> > 
> > the trace.txt file should have the entire mount trace in it, all the
> > xfs events, the trace-printk output and the console output. now you
> > can filter everything with grep, sed, awk, python, perl, any way you
> > like.
> > 
> 
> Good to know. Well, seems it can’t tell us the -EAGAIN thing? We don’t
> have trace even in every function for every case.

We don't need an explicit trace for that. If we hit that code path
in recovery, we will see a xfs_bmap_free_deferred trace for an
extent from recovery, followed by a xfs_alloc_size_busy/xfs_alloc_near_busy
event followed immediately by a xfs_bmap_free_defer event for the
extent we were trying to free.

That tells us an EAGAIN was received and the extent free was
deferred.

All the tracepoints we need to tell us what is happening in this
path, one just needs to connect the dots correctly.

> > The trace makes it pretty obvious what is happening: there's an
> > ordering problem in bulk AIL insertion.
> > 
> > Can you test the patch below on top of the three in this patchset?
> 
> This maybe work for this specific mounting, but it looks a bit tricky to me.
> Please see my comment above. I meant the below patch maybe can ‘fix’ the issue
> occurred to this specific xfs volume(metadump), but will is it sure to fix all cases?

I don't care about "other cases" right now. All I'm trying to do is
confirm that this is the *root cause* of *this problem*.

I am aware that this is a test patch, and that I wrote it purely to
confirm my hypothesis. I am aware that it doesn't fix everything and
I am aware that I am not *trying to fix everything* with it. I need
confirmation that my root cause analysis is correct before I spend
any time on working out how to fix the problem completely.

So, can you please just test the patch and see if the problem is
fixed?

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx