On Sunday 15 June 2014 at 19:36:05, Edward Shishkin wrote: > On 06/13/2014 10:28 PM, Ivan Shapovalov wrote: > > Here is my "analysis" of what happens in reiser4 during a transaction's > > lifetime wrt. block allocation and deallocation. > > > > > > THE EFFECTS (SEMANTICS) OF RELATED FUNCTIONS > > reiser4_alloc_blocks_bitmap(): allocates in WORKING BITMAP > > Yes. > > > reiser4_dealloc_blocks_bitmap(!BA_DEFER): deallocates from WORKING BITMAP > > reiser4_dealloc_blocks_bitmap(BA_DEFER): stores to ->delete_set > > This is correct for the middle-level allocator (without suffix "bitmap"). > The low-level one frees blocks only in WORKING BITMAP. Yes, this was a wording mistake. I've noticed it shortly after sending... > > reiser4_pre_commit_hook_bitmap(): allocates all relocated nodes in COMMIT BITMAP > > "Relocated" is bad term here: nodes with new data also get the flag > JNODE_RELOC. So, I would rather say, that it applies freshly allocated > nodes of the atom to COMMIT BITMAP. > > > > deallocates ->delete_set from COMMIT BITMAP > > applies the atom's delete_set to COMMIT BITMAP I've meant exactly this. > > reiser4_post_commit_hook(): deallocates ->delete_set using !BA_DEFER > > (i. e. from WORKING BITMAP) > > applies the atom's delete_set to WORKING BITMAP. ditto > I would also mention a function of the middle-level block allocator > reiser4_alloc_blocks(): allocates blocks in WORKING BITMAP. Yes, sure. > Note that the middle-level block allocator (block_alloc.c) actually > manipulates > with abstract space maps. Currently in reiser4 they are represented only by > bitmaps (plugin/space/bitmap.c). We can also implement another > representation - > extent tree (like in XFS). I don't see any needs for now, though. Extent trees seem to be interesting.. They claim that they are more efficient - is the difference that huge? > > TIMELINE OF ALLOCATIONS FOR "USUAL" NODES, AND TIMELINE OF TRANSACTION COMMIT > > - nodes are allocated using reiser4_alloc_blocks() and setting JNODE_RELOC, > > so WORKING BITMAP ensures that two nodes cannot get the same block; > > - nodes are deallocated using reiser4_dealloc_blocks(BA_DEFER), > > so their deallocation is not immediately reflected in WORKING BITMAP; > > (the relocate set is written here) > > - reiser4_pre_commit_hook_bitmap() uses 1) JNODE_RELOC flag and 2) ->delete_set > > to convey effective bitmap changes into COMMIT BITMAP; > > (the journal and overwrite set are written here) > > - reiser4_post_commit_hook() uses ->delete_set to convey deallocations > > from step 2 to WORKING BITMAP. > > (the discard happens here) > > > > > > TIMELINE OF ALLOCATIONS FOR WANDERED JOURNAL BLOCKS > > - at commit time, blocks are allocated using reiser4_alloc_blocks(), so they > > are allocated in WORKING BITMAP and do not interfere with any "usual" blocks; > > - after writing wandered blocks, they are deallocated using > > reiser4_dealloc_blocks(!BA_DEFER), i. e. from the WORKING BITMAP. > > > So, the system of working and commit bitmaps plus the delete set seems > to be redundant? I think this is because of performance reasons: block > allocation is critical thing... Seems like it is -- for performance and simplicity (== robustness). > > CONCLUSION > > At possible transaction replay time, journal blocks are not allocated in any > > of the bitmaps. However, because the journal is read and replayed before a > > transaction has a chance to commit, this fact does not matter. > > What matters is that wandered journal blocks never hit COMMIT BITMAP. > > > > So, if I've got all this correct (which I highly doubt), the disk space leak > > (as you pointed it out) does not exist. > > It seems, you are right.. > > > > > What exists is a rather different problem with my idea of "log every > > deallocated block". Current implementation logs every block regardless of > > BA_DEFER flag presence or absence, so non-wandered blocks are logged twice. > > > > We could just use ->delete_set, but we would lose wandered blocks then. > > Or we could only log !BA_DEFER requests, which would do the right thing > > (wandered blocks + deallocations from reiser4_post_commit_hook()), but > > the reasoning behind this decision would not be obvious for a casual > > code reader. > > I think that a good comment will save the situation.. > > > Or we could log only wandered blocks (in addition to ->delete_set) > > at discard time, but this is messy and requires us to merge the discard log > > with ->delete_set at discard time. > > what is the difference with the previous "we could.."? The previous option is to log all !BA_DEFER requests in addition to ->delete_set, so those two block sets would be partially overlapping. And this option is to log only those !BA_DEFER requests which are not in ->delete_set, so no block will be stored twice. Thus we will have to consider both ->delete_set and this hypothetical ->discard_set at discard time. > > Or we could log wandered blocks straight into ->delete_set and do something in > > reiser4_post_commit_hook() to separate these entries, but this is super messy. > > > > I'm preferring the second way... Edward, please proof-read all this. > > BTW, what about your current implementation? Does it work for you? Yes, it apparently does (no massive corruptions). I get batches of "orphan cluster" errors from time to time (from offset 0 till the supposed end of file), but 1) I do not see which files get corrupted, 2) this could be due to txmod=wa, 3) my machine is pretty crashy now due to some driver issues, so I don't know what exactly causes this. Thanks for proof-reading. What do you think about all this? -- Ivan Shapovalov / intelfx / > > Thanks, > Edward.
Attachment:
signature.asc
Description: This is a digitally signed message part.