Hi Brian, On Thu, Aug 21, 2014 at 10:18 PM, Brian Foster <bfoster@xxxxxxxxxx> wrote: > XFS log recovery builds up an xlog_recover object as it passes through > the log operations on the physical log. These structures are managed via > a hash table and are allocated when a new transaction is encountered and > freed once a commit operation for the transaction is encountered. > > This state machine for active transactions is implemented by a > combination of xlog_do_recovery_pass(), which walks through the log > buffers and xlog_recover_process_data() which processes log operations > within each buffer. The latter function decides whether to allocate a > new xlog_recover, add to it or commit and ultimately free it. If an > error occurs at any point during the lifecycle of a particular > xlog_recover, xlog_recover_process_data() frees the object and returns > an error. > > xlog_recover_commit_trans() handles the final processing of the > transaction. It submits whatever I/O is required for the transaction and > frees xlog_recover object along with the transaction items it tracks. If > an error occurs at the final stages of the commit operation, such as I/O > failure, both xlog_recover_commit_trans() and > xlog_recover_process_data() attempt to free the trans object. > > Modify xlog_recover_commit_trans() to only free the trans object on > successful completion of the trans, including any I/O errors that might > occur when recovering the log. > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> > --- > > Hi all, > > I found that the recent buffer I/O rework fixes didn't address the crash > reproduced by the dm-flakey/log recovery test case I posted recently. I > tracked the crash down to this, which allows the test to pass. This > addresses the crash I saw when running the reproducer manually with the > metadump that Alex posted as well. > > FWIW, I also went back and tested the xfs_buf_iowait() experiment in > both scenarios (Alex's metadump and xfstests test) and they all > reproduce the same crash for me. I think that either I'm still not > reproducing the original problem, something else might have contaminated > the original xfs_buf_iowait() test to give a false positive, or > something else entirely is going on. > > Alex, > > If you have a chance, I think it might be interesting to see whether you > reproduce any problems with this patch. It looks like this is a > regression introduced by: > > 2a84108f xfs: free the list of recovery items on error > > ... but I have no idea if that's in whatever kernel you're running. I am running kernel 3.8.13 with some changes (published at https://github.com/zadarastorage/zadara-xfs-pushback), but this problem also happens on pristine 3.8.13 from git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git, branch linux-stable-3.8.y. I do not have commit 2a84108f in this kernel. It was introduced in 3.14. I applied your patch to 3.8.13, but it doesn't fix the issue. Same problem happens when testing scenario that I described in http://oss.sgi.com/pipermail/xfs/2014-August/037637.html. Thanks, Alex. > > Brian > > fs/xfs/xfs_log_recover.c | 11 ++++++++--- > 1 file changed, 8 insertions(+), 3 deletions(-) > > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c > index 176c4b3..daca9a6 100644 > --- a/fs/xfs/xfs_log_recover.c > +++ b/fs/xfs/xfs_log_recover.c > @@ -3528,10 +3528,15 @@ out: > if (!list_empty(&done_list)) > list_splice_init(&done_list, &trans->r_itemq); > > - xlog_recover_free_trans(trans); > - > error2 = xfs_buf_delwri_submit(&buffer_list); > - return error ? error : error2; > + > + if (!error) > + error = error2; > + /* caller frees trans on error */ > + if (!error) > + xlog_recover_free_trans(trans); > + > + return error; > } > > STATIC int > -- > 1.8.3.1 > _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs