XFS log recovery builds up an xlog_recover object as it passes through the log operations on the physical log. These structures are managed via a hash table and are allocated when a new transaction is encountered and freed once a commit operation for the transaction is encountered. This state machine for active transactions is implemented by a combination of xlog_do_recovery_pass(), which walks through the log buffers and xlog_recover_process_data() which processes log operations within each buffer. The latter function decides whether to allocate a new xlog_recover, add to it or commit and ultimately free it. If an error occurs at any point during the lifecycle of a particular xlog_recover, xlog_recover_process_data() frees the object and returns an error. xlog_recover_commit_trans() handles the final processing of the transaction. It submits whatever I/O is required for the transaction and frees xlog_recover object along with the transaction items it tracks. If an error occurs at the final stages of the commit operation, such as I/O failure, both xlog_recover_commit_trans() and xlog_recover_process_data() attempt to free the trans object. Modify xlog_recover_commit_trans() to only free the trans object on successful completion of the trans, including any I/O errors that might occur when recovering the log. Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> --- Hi all, I found that the recent buffer I/O rework fixes didn't address the crash reproduced by the dm-flakey/log recovery test case I posted recently. I tracked the crash down to this, which allows the test to pass. This addresses the crash I saw when running the reproducer manually with the metadump that Alex posted as well. FWIW, I also went back and tested the xfs_buf_iowait() experiment in both scenarios (Alex's metadump and xfstests test) and they all reproduce the same crash for me. I think that either I'm still not reproducing the original problem, something else might have contaminated the original xfs_buf_iowait() test to give a false positive, or something else entirely is going on. Alex, If you have a chance, I think it might be interesting to see whether you reproduce any problems with this patch. It looks like this is a regression introduced by: 2a84108f xfs: free the list of recovery items on error ... but I have no idea if that's in whatever kernel you're running. Brian fs/xfs/xfs_log_recover.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index 176c4b3..daca9a6 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -3528,10 +3528,15 @@ out: if (!list_empty(&done_list)) list_splice_init(&done_list, &trans->r_itemq); - xlog_recover_free_trans(trans); - error2 = xfs_buf_delwri_submit(&buffer_list); - return error ? error : error2; + + if (!error) + error = error2; + /* caller frees trans on error */ + if (!error) + xlog_recover_free_trans(trans); + + return error; } STATIC int -- 1.8.3.1 _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs