Re: [PATCH 1/3] xfs: pass alloc flags through to xfs_extent_busy_flush()

Wengang Wang <wen.gang.wang@xxxxxxxxxx> · Fri, 16 Jun 2023 22:53:45 +0000

> On Jun 16, 2023, at 3:29 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> 
> On Fri, Jun 16, 2023 at 05:43:42PM +0000, Wengang Wang wrote:
>>> On Jun 16, 2023, at 12:36 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>> On Fri, Jun 16, 2023 at 04:27:32AM +0000, Wengang Wang wrote:
>>>>> On Jun 15, 2023, at 5:42 PM, Wengang Wang <wen.gang.wang@xxxxxxxxxx> wrote:
>>>>>> On Jun 15, 2023, at 5:17 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>>>>> Can you please post that debug and analysis, rather than just a
>>>>>> stack trace that is completely lacking in context? Nothing can be
>>>>>> inferred from a stack trace, and what you are saying is occurring
>>>>>> does not match what the code should actually be doing. So I need to
>>>>>> actually look at what is happening in detail to work out where this
>>>>>> mismatch is coming from....
>>>>> 
>>>>> The debug patch was based on my previous patch, I will rework the debug patch
>>>>> basing on yours. I will share you the debug patch, output and my analysis later. 
>>>>> 
>>>> 
>>>> My analysis:
>>>> It’s problem of double free. The first free is from xfs_efi_item_recover(), the
>>>> second free is from xfs_extent_free_finish_item().
>>>> Dave’s patch makes it possible that xfs_trans_free_extent() returns -EAGAIN.
>>>> The double free problem begins when the -EAGAIN is returned.
>>>> 
>>>> 1. -EAGAIN returned by xfs_trans_free_extent(),  see line 5 in the debug output.
>>>> 2. according to -EAGAIN, xfs_free_extent_later() is called to create a deferred
>>>>  operation of type XFS_DEFER_OPS_TYPE_AGFL_FREE  to current transaction.
>>>>  see line 6.
>>>> 3. New EFI (xfs_efi_log_item) is created and attached to current transaction.
>>>>  And the deferred options is moved to capture_list. see line 9. The call path is:
>>>>  xfs_efi_item_recover()
>>>>    xfs_defer_ops_capture_and_commit(tp, capture_list)
>>>>      xfs_defer_ops_capture()
>>>>        xfs_defer_create_intents()
>>>>          xfs_defer_create_intent()
>>>>            ops->create_intent() —>   xfs_extent_free_create_intent()
>>>> 4. The new EFI is committed with current transaction. see line 10.
>>>> 5. The mount process (log recover) continue to work with other intents. line 11 and line 12.
>>>> 6. The committed new EFI was added to AIL after log flush by a parallel thread. line 20.
>>>> 7. The mount process (log recover) continue work with other intents. line 21 to line 35.
>>>> 8. The mount process (log recover) come to process the new EFI that was added to AIL
>>>>  at step 6. I guess the previous log items, which were added to AIL together with the
>>>>  EFI, were removed by xfsaild (I didn’t log that) so they didn’t appear in the AIL intents
>>>>  iteration in xlog_recover_process_intents().  see line 36.
>>>> 9. The new EFI record (0x25441ca, 0x30) is freed by xfs_efi_item_recover(). \
>>>>  That’s the first free. see line 37.
>>>> 10. The AIL intents iteration is done. It begins to process the capture_list.
>>>> 11. it comes to the XFS_DEFER_OPS_TYPE_AGFL_FREE deferred operation which is
>>>>   added in step 3. xfs_extent_free_finish_item() is called to free (2nd free) (0x25441ca, 0x30)
>>>>   and it failed because (0x25441ca, 0x30) was already freed at step 9. see line 43.
>>>>   So, it’s a double free issue.
>>> 
>>> Yes, this much I understand.
>>> 
>>> As I've been saying all along, there's something else wrong here,
>>> because we should never encounter that new EFI in the recovery loop.
>>> The recovery loop should see some other type of item and abort
>>> before it sees that EFI.
>> 
>> Maybe you are right. But here is my concern, its possible that:
>> Even the were other types of items placed in AIL before that new EFI
>> (after your change list_add() -> list_add_tail()), suppose this scenario:
>> 
>> # suppose there were many log intents after the Original EFI (which needs retry)
>>   to process. 
>> 1.  Non EFI type log items (say Item A to Item C) were added to AIL.
>> 2.  That new EFI log item, Item D was added to AIL.
>> 3.  Now the recovery process the remaining log intents. it takes long enough
>>     to let step 4 happen. 
>> 4.  During the time that the mount (log recover) process was processing the
>>     remaining intents (suppose that needs long enough time), Item A, Item B
>>     and Item C (on AIL) were processed by xfs_aild (iop_push()) and them was
>>     removed from AIL later.  Possible?
>> 5.  In the recovery intents loop, it comes to the first new log item, that is Item D,
>>     the new EFI.  (Item A to Item C didn’t come before Item D because they were
>>     removed in step 4).
>> 6.  the records in the new EFI was freed by xfs_efi_item_recover()
>> 7.  the same records in the same new EFI was freed by xfs_extent_free_finish_item()
>>     again.
>> 
>> If above is possible, the problem still exists even you make sure the new log
>> items in original order when they are placed to AIL.
> 
> Yes, I *know this can happen*.
> 
> What I've been trying to understand is *how the bug occurred in the
> first place*. 
> 
> Root cause analysis comes first. Second is confirming the analysis
> is correct. Only then do we really start thinking about how to fix
> it properly.
> 
> We are at the second step: I need *empirical confirmation* that this
> ordering problem is the root cause of the behaviour that was
> observed.
> 
> Stop trying to fix the problem before we understand how it happened!
> 
>>> Go back to your trace-cmd shell and ctrl-c it. Wait for it to dump
>>> all the trace events to the trace.dat file. Then run:
>>> 
>>> # trace-cmd report > trace.txt
>>> 
>>> the trace.txt file should have the entire mount trace in it, all the
>>> xfs events, the trace-printk output and the console output. now you
>>> can filter everything with grep, sed, awk, python, perl, any way you
>>> like.
>>> 
>> 
>> Good to know. Well, seems it can’t tell us the -EAGAIN thing? We don’t
>> have trace even in every function for every case.
> 
> We don't need an explicit trace for that. If we hit that code path
> in recovery, we will see a xfs_bmap_free_deferred trace for an
> extent from recovery, followed by a xfs_alloc_size_busy/xfs_alloc_near_busy
> event followed immediately by a xfs_bmap_free_defer event for the
> extent we were trying to free.
> 
> That tells us an EAGAIN was received and the extent free was
> deferred.
> 
> All the tracepoints we need to tell us what is happening in this
> path, one just needs to connect the dots correctly.
> 
>>> The trace makes it pretty obvious what is happening: there's an
>>> ordering problem in bulk AIL insertion.
>>> 
>>> Can you test the patch below on top of the three in this patchset?
>> 
>> This maybe work for this specific mounting, but it looks a bit tricky to me.
>> Please see my comment above. I meant the below patch maybe can ‘fix’ the issue
>> occurred to this specific xfs volume(metadump), but will is it sure to fix all cases?
> 
> I don't care about "other cases" right now. All I'm trying to do is
> confirm that this is the *root cause* of *this problem*.
> 
> I am aware that this is a test patch, and that I wrote it purely to
> confirm my hypothesis. I am aware that it doesn't fix everything and
> I am aware that I am not *trying to fix everything* with it. I need
> confirmation that my root cause analysis is correct before I spend
> any time on working out how to fix the problem completely.
> 
> So, can you please just test the patch and see if the problem is
> fixed?

Then OK, I will test it and report back.

thanks,
wengang