> On Jun 16, 2023, at 3:29 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Fri, Jun 16, 2023 at 05:43:42PM +0000, Wengang Wang wrote: >>> On Jun 16, 2023, at 12:36 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: >>> On Fri, Jun 16, 2023 at 04:27:32AM +0000, Wengang Wang wrote: >>>>> On Jun 15, 2023, at 5:42 PM, Wengang Wang <wen.gang.wang@xxxxxxxxxx> wrote: >>>>>> On Jun 15, 2023, at 5:17 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: >>>>>> Can you please post that debug and analysis, rather than just a >>>>>> stack trace that is completely lacking in context? Nothing can be >>>>>> inferred from a stack trace, and what you are saying is occurring >>>>>> does not match what the code should actually be doing. So I need to >>>>>> actually look at what is happening in detail to work out where this >>>>>> mismatch is coming from.... >>>>> >>>>> The debug patch was based on my previous patch, I will rework the debug patch >>>>> basing on yours. I will share you the debug patch, output and my analysis later. >>>>> >>>> >>>> My analysis: >>>> It’s problem of double free. The first free is from xfs_efi_item_recover(), the >>>> second free is from xfs_extent_free_finish_item(). >>>> Dave’s patch makes it possible that xfs_trans_free_extent() returns -EAGAIN. >>>> The double free problem begins when the -EAGAIN is returned. >>>> >>>> 1. -EAGAIN returned by xfs_trans_free_extent(), see line 5 in the debug output. >>>> 2. according to -EAGAIN, xfs_free_extent_later() is called to create a deferred >>>> operation of type XFS_DEFER_OPS_TYPE_AGFL_FREE to current transaction. >>>> see line 6. >>>> 3. New EFI (xfs_efi_log_item) is created and attached to current transaction. >>>> And the deferred options is moved to capture_list. see line 9. The call path is: >>>> xfs_efi_item_recover() >>>> xfs_defer_ops_capture_and_commit(tp, capture_list) >>>> xfs_defer_ops_capture() >>>> xfs_defer_create_intents() >>>> xfs_defer_create_intent() >>>> ops->create_intent() —> xfs_extent_free_create_intent() >>>> 4. The new EFI is committed with current transaction. see line 10. >>>> 5. The mount process (log recover) continue to work with other intents. line 11 and line 12. >>>> 6. The committed new EFI was added to AIL after log flush by a parallel thread. line 20. >>>> 7. The mount process (log recover) continue work with other intents. line 21 to line 35. >>>> 8. The mount process (log recover) come to process the new EFI that was added to AIL >>>> at step 6. I guess the previous log items, which were added to AIL together with the >>>> EFI, were removed by xfsaild (I didn’t log that) so they didn’t appear in the AIL intents >>>> iteration in xlog_recover_process_intents(). see line 36. >>>> 9. The new EFI record (0x25441ca, 0x30) is freed by xfs_efi_item_recover(). \ >>>> That’s the first free. see line 37. >>>> 10. The AIL intents iteration is done. It begins to process the capture_list. >>>> 11. it comes to the XFS_DEFER_OPS_TYPE_AGFL_FREE deferred operation which is >>>> added in step 3. xfs_extent_free_finish_item() is called to free (2nd free) (0x25441ca, 0x30) >>>> and it failed because (0x25441ca, 0x30) was already freed at step 9. see line 43. >>>> So, it’s a double free issue. >>> >>> Yes, this much I understand. >>> >>> As I've been saying all along, there's something else wrong here, >>> because we should never encounter that new EFI in the recovery loop. >>> The recovery loop should see some other type of item and abort >>> before it sees that EFI. >> >> Maybe you are right. But here is my concern, its possible that: >> Even the were other types of items placed in AIL before that new EFI >> (after your change list_add() -> list_add_tail()), suppose this scenario: >> >> # suppose there were many log intents after the Original EFI (which needs retry) >> to process. >> 1. Non EFI type log items (say Item A to Item C) were added to AIL. >> 2. That new EFI log item, Item D was added to AIL. >> 3. Now the recovery process the remaining log intents. it takes long enough >> to let step 4 happen. >> 4. During the time that the mount (log recover) process was processing the >> remaining intents (suppose that needs long enough time), Item A, Item B >> and Item C (on AIL) were processed by xfs_aild (iop_push()) and them was >> removed from AIL later. Possible? >> 5. In the recovery intents loop, it comes to the first new log item, that is Item D, >> the new EFI. (Item A to Item C didn’t come before Item D because they were >> removed in step 4). >> 6. the records in the new EFI was freed by xfs_efi_item_recover() >> 7. the same records in the same new EFI was freed by xfs_extent_free_finish_item() >> again. >> >> If above is possible, the problem still exists even you make sure the new log >> items in original order when they are placed to AIL. > > Yes, I *know this can happen*. > > What I've been trying to understand is *how the bug occurred in the > first place*. > > Root cause analysis comes first. Second is confirming the analysis > is correct. Only then do we really start thinking about how to fix > it properly. > > We are at the second step: I need *empirical confirmation* that this > ordering problem is the root cause of the behaviour that was > observed. > > Stop trying to fix the problem before we understand how it happened! > >>> Go back to your trace-cmd shell and ctrl-c it. Wait for it to dump >>> all the trace events to the trace.dat file. Then run: >>> >>> # trace-cmd report > trace.txt >>> >>> the trace.txt file should have the entire mount trace in it, all the >>> xfs events, the trace-printk output and the console output. now you >>> can filter everything with grep, sed, awk, python, perl, any way you >>> like. >>> >> >> Good to know. Well, seems it can’t tell us the -EAGAIN thing? We don’t >> have trace even in every function for every case. > > We don't need an explicit trace for that. If we hit that code path > in recovery, we will see a xfs_bmap_free_deferred trace for an > extent from recovery, followed by a xfs_alloc_size_busy/xfs_alloc_near_busy > event followed immediately by a xfs_bmap_free_defer event for the > extent we were trying to free. > > That tells us an EAGAIN was received and the extent free was > deferred. > > All the tracepoints we need to tell us what is happening in this > path, one just needs to connect the dots correctly. > >>> The trace makes it pretty obvious what is happening: there's an >>> ordering problem in bulk AIL insertion. >>> >>> Can you test the patch below on top of the three in this patchset? >> >> This maybe work for this specific mounting, but it looks a bit tricky to me. >> Please see my comment above. I meant the below patch maybe can ‘fix’ the issue >> occurred to this specific xfs volume(metadump), but will is it sure to fix all cases? > > I don't care about "other cases" right now. All I'm trying to do is > confirm that this is the *root cause* of *this problem*. > > I am aware that this is a test patch, and that I wrote it purely to > confirm my hypothesis. I am aware that it doesn't fix everything and > I am aware that I am not *trying to fix everything* with it. I need > confirmation that my root cause analysis is correct before I spend > any time on working out how to fix the problem completely. > > So, can you please just test the patch and see if the problem is > fixed? Then OK, I will test it and report back. thanks, wengang