On Fri, Jan 21, 2011 at 04:22:32AM -0500, Christoph Hellwig wrote: > During an allocation or freeing of an extent, we occasionally have an > interesting situation happen during a btree update. We may merge a block > as part of a record delete, then allocate it again immediately afterwards > when inserting a modified extent. xfs_alloc_fixup_trees() does this sort > of manipulation to the btrees, and so can merge then immediately split > the tree resulting the in the same busy block being used from the > freelist. > > Previously, this would trigger a synchronous log force of the entire log > (as the current transaction had a lsn of 0 until committed), but continue > to allow the extent to be used because if it is not on disk now then it > must have been freed and marked busy in this transaction. > > In the case of allocbt blocks, we log the fact that they get moved to the > freelist and we also log them when the move off the free list back into > the free space trees. When we move the blocks off the freelist back to > the trees, we mark the extent busy at that point so that it doesn't get > reused until the transaction is committed. I'm not sure this is correct - it's when they are put on the freelist that they are marked busy (xfs_allocbt_free_block), and with the previous patch they now don't get marked busy when freed back to the trees. > This means that as long as the block is on the AGFL and is only used > for allocbt blocks, we don't have to force the log to ensure prior > manipulating transactions are on disk as the process of log recovery > will replay the transactions in the correct order. I think that as long as the busy extent is reallocated as metadata of any kind, it does not require us to force the log/mark the transaction synchronous. That is, the metadata modifications will not be written to the extent until the allocation transaction has hit the disk regardless of whether we force the log here or not. Hence log recovery will always do the right thing here. It's the metadata -> data extent re-allocation that we need the protecion for, as we cannot serialise data extent writes against log recovery requirements. That is, we have no idea if the data extent contents have been written or not when we run log recovery, so we cannot allow data to be written until we are certain that that the data extent allocation overrides all the recorded metadata changes in the log. For data -> metadata reallocation, this is not a problem as we know the metadata writes will not be done (and therefore overwrite the data on disk) until the transactions are on disk and the metadata buffers unpinned. The same goes for metadata->metadata reallocation. Hence I don't think we specifically need to track allocbt blocks in the busy list - we track all freed blocks like we currently do, but we only need to search the busy block list and mark transactions synchronous for data extent allocations. This removes the complexity of removing busy allocbt blocks from the busy tree when they get reused - if they remain allocated they will be removed from the tree as transactions/checkpoints are committed to disk just like they currently are. That also avoids the problem of a transaction commit trying to remove a busy extent from the tree that isn't in there anymore or potential races based around transactions adding and removing busy blocks that log IO completion will be trying to remove. Does this make sense? I understand this code a lot better than I did when I originally wrote these patches, and to tell the truth I can't remember exactly why I took this approach. Looking over the patches makes me think that I was trying to do the minimum possible semantic change to solve the specific problem I was seeing, rather than understanding the entire workings of the busy extent list and how to optimise from there... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs