[ sorry to take so long to get back to this, Brian, I missed your reply and only yesterday when I was sorting out for-next updates that I still had this on my "for-review" patch stack. ] On Mon, Aug 29, 2016 at 02:17:22PM -0400, Brian Foster wrote: > On Mon, Aug 29, 2016 at 11:16:31AM +1000, Dave Chinner wrote: > > On Thu, Aug 11, 2016 at 01:11:03PM -0400, Brian Foster wrote: > > i.e. We are very careful to write commit records in the correct > > order because that is what determines recovery order, but we don't > > care what order we write the actual contents of the checkpoints or > > whether they interleave with other checkpoints. As such, ophdrs > > change transactions and LSNs without having actually completed > > recovery of a checkpoint. > > > > I think writeback should occur when all the transactions with a > > given lsn have been committed. I'm not sure there's a simple way to > > track and detect this, but using the ophdrs to detect a change of > > lsn to trigger buffer writeback does not look correct to me at this > > point in time. > > > > That is precisely the intent of this patch. What I think could be a > problem is something like the following, if possible: > > CA CB CC CD > +---------+--------+--+-------+--+--------+-------+--+--+ > trans A trans B trans C trans C trans D Yes, that's possible. > Assume that trans A and trans B are within the same record and trans C > is in a separate record. In that case, we commit trans A which populates > buffer_list. We lookup trans C, note a new LSN and drain buffer_list. > Then we ultimately commit trans B, which has the same metadata LSN as > trans A and thus is a path to the original problem if trans B happened > to modify any of the same blocks as trans A. Yes, that's right, we still are exposed to the same problem, and there's much more convoluted versions of it possible. > Do note however that this is just an occurrence of the problem with log > recovery as implemented today (once we update metadata LSNs, and is > likely rare as I haven't been able to reproduce corruption in many > tries). Yeah, it's damn hard to intentionally cause interleaving of checkpoint and commit records these days because of the delayed logging does aggregation in memory rather than in the log buffers themselves. > If that analysis is correct, I think a straightforward solution > might be to defer submission to the lookup of a transaction with a new > LSN that _also_ corresponds with processing of a commit record based on > where we are in the on-disk log. E.g.: > > if (log->l_recovery_lsn != trans->r_lsn && > oh_flags & XLOG_COMMIT_TRANS) { > error = xfs_buf_delwri_submit(buffer_list); > ... > } > > So in the above, we'd submit buffers for A and B once we visit the > commit record for trans C. Thoughts? Sounds plausible - let me just check I understood by repeating it back. Given the above case, we start with log->l_recovery_lsn set to the lsn before trans A and an empty buffer list. 1. We now recover trans A and trans B into their respective structures, but we don't don't add their dirty buffers to the delwri list yet - they are kept internal to the trans. 2. We then see commit A, and because the buffer list is empty we simply add them to the buffer list and update log->l_recovery_lsn to point at the transaction LSN. 3. We now see trans C, and start recovering it into an internal buffer list. 4. Then we process commit B, see that there are already queued buffers and so check the transaction LSN against log->l_recovery_lsn. They are the same, so we simply add the transactions dirty buffers to the buffer list. 5. We continue processing transaction C, and start on transaction D. We then see commit C. Buffer list is populated, so we check transaction lsn against log->l_recovery_lsn. They are different. At this point we know we have fully processed all the transactions that are associated with log->l_recovery_lsn, hence we can submit the buffer_list and mark it empty again. 6. At this point we jump back to step 2, this time processing commit C onwards.... 7. At the end of log recovery, we commit the remaining buffer list from the last transaction we recovered from the log. Did I understand it right? If so, I think this will work just fine. Thanks, Brian! -Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs