On Monday 16 June 2014 at 02:14:49, Edward Shishkin wrote: > > On 06/15/2014 11:58 PM, Ivan Shapovalov wrote: > > On Sunday 15 June 2014 at 23:49:59, Edward Shishkin wrote: > >> On 06/15/2014 08:07 PM, Ivan Shapovalov wrote: > >>> On Sunday 15 June 2014 at 19:36:05, Edward Shishkin wrote: > >>>> On 06/13/2014 10:28 PM, Ivan Shapovalov wrote: > >>>>> Here is my "analysis" of what happens in reiser4 during a transaction's > >>>>> lifetime wrt. block allocation and deallocation. > >>>>> > >>>>> > >>>>> THE EFFECTS (SEMANTICS) OF RELATED FUNCTIONS > >>>>> reiser4_alloc_blocks_bitmap(): allocates in WORKING BITMAP > >>>> Yes. > >>>> > >>>>> reiser4_dealloc_blocks_bitmap(!BA_DEFER): deallocates from WORKING BITMAP > >>>>> reiser4_dealloc_blocks_bitmap(BA_DEFER): stores to ->delete_set > >>>> This is correct for the middle-level allocator (without suffix "bitmap"). > >>>> The low-level one frees blocks only in WORKING BITMAP. > >>> Yes, this was a wording mistake. I've noticed it shortly after sending... > >>> > >>>>> reiser4_pre_commit_hook_bitmap(): allocates all relocated nodes in COMMIT BITMAP > >>>> "Relocated" is bad term here: nodes with new data also get the flag > >>>> JNODE_RELOC. So, I would rather say, that it applies freshly allocated > >>>> nodes of the atom to COMMIT BITMAP. > >>>> > >>>> > >>>>> deallocates ->delete_set from COMMIT BITMAP > >>>> applies the atom's delete_set to COMMIT BITMAP > >>> I've meant exactly this. > >>> > >>>>> reiser4_post_commit_hook(): deallocates ->delete_set using !BA_DEFER > >>>>> (i. e. from WORKING BITMAP) > >>>> applies the atom's delete_set to WORKING BITMAP. > >>> ditto > >>> > >>>> I would also mention a function of the middle-level block allocator > >>>> reiser4_alloc_blocks(): allocates blocks in WORKING BITMAP. > >>> Yes, sure. > >>> > >>> > >>>> Note that the middle-level block allocator (block_alloc.c) actually > >>>> manipulates > >>>> with abstract space maps. Currently in reiser4 they are represented only by > >>>> bitmaps (plugin/space/bitmap.c). We can also implement another > >>>> representation - > >>>> extent tree (like in XFS). I don't see any needs for now, though. > >>> Extent trees seem to be interesting.. They claim that they are more efficient - > >>> is the difference that huge? > >>> > >>>>> TIMELINE OF ALLOCATIONS FOR "USUAL" NODES, AND TIMELINE OF TRANSACTION COMMIT > >>>>> - nodes are allocated using reiser4_alloc_blocks() and setting JNODE_RELOC, > >>>>> so WORKING BITMAP ensures that two nodes cannot get the same block; > >>>>> - nodes are deallocated using reiser4_dealloc_blocks(BA_DEFER), > >>>>> so their deallocation is not immediately reflected in WORKING BITMAP; > >>>>> (the relocate set is written here) > >>>>> - reiser4_pre_commit_hook_bitmap() uses 1) JNODE_RELOC flag and 2) ->delete_set > >>>>> to convey effective bitmap changes into COMMIT BITMAP; > >>>>> (the journal and overwrite set are written here) > >>>>> - reiser4_post_commit_hook() uses ->delete_set to convey deallocations > >>>>> from step 2 to WORKING BITMAP. > >>>>> (the discard happens here) > >>>>> > >>>>> > >>>>> TIMELINE OF ALLOCATIONS FOR WANDERED JOURNAL BLOCKS > >>>>> - at commit time, blocks are allocated using reiser4_alloc_blocks(), so they > >>>>> are allocated in WORKING BITMAP and do not interfere with any "usual" blocks; > >>>>> - after writing wandered blocks, they are deallocated using > >>>>> reiser4_dealloc_blocks(!BA_DEFER), i. e. from the WORKING BITMAP. > >>>> So, the system of working and commit bitmaps plus the delete set seems > >>>> to be redundant? I think this is because of performance reasons: block > >>>> allocation is critical thing... > >>> Seems like it is -- for performance and simplicity (== robustness). > >>> > >>>>> CONCLUSION > >>>>> At possible transaction replay time, journal blocks are not allocated in any > >>>>> of the bitmaps. However, because the journal is read and replayed before a > >>>>> transaction has a chance to commit, this fact does not matter. > >>>>> What matters is that wandered journal blocks never hit COMMIT BITMAP. > >>>>> > >>>>> So, if I've got all this correct (which I highly doubt), the disk space leak > >>>>> (as you pointed it out) does not exist. > >>>> It seems, you are right.. > >>>> > >>>>> What exists is a rather different problem with my idea of "log every > >>>>> deallocated block". Current implementation logs every block regardless of > >>>>> BA_DEFER flag presence or absence, so non-wandered blocks are logged twice. > >>>>> > >>>>> We could just use ->delete_set, but we would lose wandered blocks then. > >>>>> Or we could only log !BA_DEFER requests, which would do the right thing > >>>>> (wandered blocks + deallocations from reiser4_post_commit_hook()), but > >>>>> the reasoning behind this decision would not be obvious for a casual > >>>>> code reader. > >>>> I think that a good comment will save the situation.. > >>>> > >>>>> Or we could log only wandered blocks (in addition to ->delete_set) > >>>>> at discard time, but this is messy and requires us to merge the discard log > >>>>> with ->delete_set at discard time. > >>>> what is the difference with the previous "we could.."? > >>> The previous option is to log all !BA_DEFER requests in addition to > >>> ->delete_set, so those two block sets would be partially overlapping. > >> > >> Hmm.. Are they really overlapped? > >> > >> reiser4_dealloc_blocks() looks like the following: > >> { > >> if (flags & BA_DEFER) > >> log in the delete_set; > >> else > >> deallocate in working bitmap > >> ... > >> } > >> > >> so (!BA_DEFER) requests and the delete_set look disjoint, > >> or, I missed something? > > You're right, I've misread the code. I was under impression that > > reiser4_post_commit_hook() internally calls reiser4_dealloc_blocks(!BA_DEFER). > > > > So the second option (which I prefer) is to log all sa_dealloc_blocks() calls. > > That is, effectively, ->delete_set plus wandered journal block deallocations. > > > Yes, I think to do something like this: > > reiser4_dealloc_blocks() > { > if (flags & BA_DEFER) { > ... > } else { > ... > if (discard is turned on) > /* we'll want to discard these blocks, which are not to > be represented > in the COMMIT SPACE MAP, so store them in a > separate list */ > do { > blocknr_set_add_extent(atom, > &atom->delete_set_for_wander, &bsep, start, len); > ... > } while (ret == -E_REPEAT); > sa_dealloc_blocks(reiser4_get_space_allocator(ctx->super), *start, *len); > ... > } > > Then join the sets (->delete_set, and ->delete_set_for_wander) at > discard time > (should be pretty cheap operation). insert a check at every atoms > fusion, that > ->delete_set_for_wander of both atoms are empty). Hm. IIRC, ultimately we wanted to use rb-trees for storing discard block sets, but this means that we'll need to convert two blocknr_sets into an rb-tree at discard time, which pretty defeats their purpose in our context... Or do you say that blocknr_set itself should be made rb-tree? Thanks, -- Ivan Shapovalov / intelfx / > > > > > >>> And this option is to log only those !BA_DEFER requests which are not in > >>> ->delete_set, so no block will be stored twice. Thus we will have to consider > >>> both ->delete_set and this hypothetical ->discard_set at discard time. > >>> > >>>>> Or we could log wandered blocks straight into ->delete_set and do something in > >>>>> reiser4_post_commit_hook() to separate these entries, but this is super messy. > >>>>> > >>>>> I'm preferring the second way... Edward, please proof-read all this. > >>>> BTW, what about your current implementation? Does it work for you? > >>> Yes, it apparently does (no massive corruptions). > >>> > >>> I get batches of "orphan cluster" errors from time to time (from offset 0 till > >>> the supposed end of file), > >> > >> this is fsck complaints? > > Yes. (I do `fsck.reiser4 --build-fs` pretty routinely...) > > > >>> but 1) I do not see which files get corrupted, > >>> 2) this could be due to txmod=wa, 3) my machine is pretty crashy now due to > >>> some driver issues, so I don't know what exactly causes this. > >>> > >>> Thanks for proof-reading. What do you think about all this? > >> > >> I think that not bad for the beginning! I have purchased an SSD card, > >> hope to test the discard stuff in the near future. > > And I hope to finish this feature somewhen... :) > > > > Thanks, >
Attachment:
signature.asc
Description: This is a digitally signed message part.