On Monday 16 June 2014 at 11:24:39, Edward Shishkin wrote: > > On 06/16/2014 07:03 AM, Ivan Shapovalov wrote: > > On Monday 16 June 2014 at 02:14:49, Edward Shishkin wrote: > >> On 06/15/2014 11:58 PM, Ivan Shapovalov wrote: > >>> On Sunday 15 June 2014 at 23:49:59, Edward Shishkin wrote: > >>>> On 06/15/2014 08:07 PM, Ivan Shapovalov wrote: > >>>>> On Sunday 15 June 2014 at 19:36:05, Edward Shishkin wrote: > >>>>>> On 06/13/2014 10:28 PM, Ivan Shapovalov wrote: > >>>>>>> Here is my "analysis" of what happens in reiser4 during a transaction's > >>>>>>> lifetime wrt. block allocation and deallocation. > >>>>>>> > >>>>>>> > >>>>>>> THE EFFECTS (SEMANTICS) OF RELATED FUNCTIONS > >>>>>>> reiser4_alloc_blocks_bitmap(): allocates in WORKING BITMAP > >>>>>> Yes. > >>>>>> > >>>>>>> reiser4_dealloc_blocks_bitmap(!BA_DEFER): deallocates from WORKING BITMAP > >>>>>>> reiser4_dealloc_blocks_bitmap(BA_DEFER): stores to ->delete_set > >>>>>> This is correct for the middle-level allocator (without suffix "bitmap"). > >>>>>> The low-level one frees blocks only in WORKING BITMAP. > >>>>> Yes, this was a wording mistake. I've noticed it shortly after sending... > >>>>> > >>>>>>> reiser4_pre_commit_hook_bitmap(): allocates all relocated nodes in COMMIT BITMAP > >>>>>> "Relocated" is bad term here: nodes with new data also get the flag > >>>>>> JNODE_RELOC. So, I would rather say, that it applies freshly allocated > >>>>>> nodes of the atom to COMMIT BITMAP. > >>>>>> > >>>>>> > >>>>>>> deallocates ->delete_set from COMMIT BITMAP > >>>>>> applies the atom's delete_set to COMMIT BITMAP > >>>>> I've meant exactly this. > >>>>> > >>>>>>> reiser4_post_commit_hook(): deallocates ->delete_set using !BA_DEFER > >>>>>>> (i. e. from WORKING BITMAP) > >>>>>> applies the atom's delete_set to WORKING BITMAP. > >>>>> ditto > >>>>> > >>>>>> I would also mention a function of the middle-level block allocator > >>>>>> reiser4_alloc_blocks(): allocates blocks in WORKING BITMAP. > >>>>> Yes, sure. > >>>>> > >>>>> > >>>>>> Note that the middle-level block allocator (block_alloc.c) actually > >>>>>> manipulates > >>>>>> with abstract space maps. Currently in reiser4 they are represented only by > >>>>>> bitmaps (plugin/space/bitmap.c). We can also implement another > >>>>>> representation - > >>>>>> extent tree (like in XFS). I don't see any needs for now, though. > >>>>> Extent trees seem to be interesting.. They claim that they are more efficient - > >>>>> is the difference that huge? > >>>>> > >>>>>>> TIMELINE OF ALLOCATIONS FOR "USUAL" NODES, AND TIMELINE OF TRANSACTION COMMIT > >>>>>>> - nodes are allocated using reiser4_alloc_blocks() and setting JNODE_RELOC, > >>>>>>> so WORKING BITMAP ensures that two nodes cannot get the same block; > >>>>>>> - nodes are deallocated using reiser4_dealloc_blocks(BA_DEFER), > >>>>>>> so their deallocation is not immediately reflected in WORKING BITMAP; > >>>>>>> (the relocate set is written here) > >>>>>>> - reiser4_pre_commit_hook_bitmap() uses 1) JNODE_RELOC flag and 2) ->delete_set > >>>>>>> to convey effective bitmap changes into COMMIT BITMAP; > >>>>>>> (the journal and overwrite set are written here) > >>>>>>> - reiser4_post_commit_hook() uses ->delete_set to convey deallocations > >>>>>>> from step 2 to WORKING BITMAP. > >>>>>>> (the discard happens here) > >>>>>>> > >>>>>>> > >>>>>>> TIMELINE OF ALLOCATIONS FOR WANDERED JOURNAL BLOCKS > >>>>>>> - at commit time, blocks are allocated using reiser4_alloc_blocks(), so they > >>>>>>> are allocated in WORKING BITMAP and do not interfere with any "usual" blocks; > >>>>>>> - after writing wandered blocks, they are deallocated using > >>>>>>> reiser4_dealloc_blocks(!BA_DEFER), i. e. from the WORKING BITMAP. > >>>>>> So, the system of working and commit bitmaps plus the delete set seems > >>>>>> to be redundant? I think this is because of performance reasons: block > >>>>>> allocation is critical thing... > >>>>> Seems like it is -- for performance and simplicity (== robustness). > >>>>> > >>>>>>> CONCLUSION > >>>>>>> At possible transaction replay time, journal blocks are not allocated in any > >>>>>>> of the bitmaps. However, because the journal is read and replayed before a > >>>>>>> transaction has a chance to commit, this fact does not matter. > >>>>>>> What matters is that wandered journal blocks never hit COMMIT BITMAP. > >>>>>>> > >>>>>>> So, if I've got all this correct (which I highly doubt), the disk space leak > >>>>>>> (as you pointed it out) does not exist. > >>>>>> It seems, you are right.. > >>>>>> > >>>>>>> What exists is a rather different problem with my idea of "log every > >>>>>>> deallocated block". Current implementation logs every block regardless of > >>>>>>> BA_DEFER flag presence or absence, so non-wandered blocks are logged twice. > >>>>>>> > >>>>>>> We could just use ->delete_set, but we would lose wandered blocks then. > >>>>>>> Or we could only log !BA_DEFER requests, which would do the right thing > >>>>>>> (wandered blocks + deallocations from reiser4_post_commit_hook()), but > >>>>>>> the reasoning behind this decision would not be obvious for a casual > >>>>>>> code reader. > >>>>>> I think that a good comment will save the situation.. > >>>>>> > >>>>>>> Or we could log only wandered blocks (in addition to ->delete_set) > >>>>>>> at discard time, but this is messy and requires us to merge the discard log > >>>>>>> with ->delete_set at discard time. > >>>>>> what is the difference with the previous "we could.."? > >>>>> The previous option is to log all !BA_DEFER requests in addition to > >>>>> ->delete_set, so those two block sets would be partially overlapping. > >>>> Hmm.. Are they really overlapped? > >>>> > >>>> reiser4_dealloc_blocks() looks like the following: > >>>> { > >>>> if (flags & BA_DEFER) > >>>> log in the delete_set; > >>>> else > >>>> deallocate in working bitmap > >>>> ... > >>>> } > >>>> > >>>> so (!BA_DEFER) requests and the delete_set look disjoint, > >>>> or, I missed something? > >>> You're right, I've misread the code. I was under impression that > >>> reiser4_post_commit_hook() internally calls reiser4_dealloc_blocks(!BA_DEFER). > >>> > >>> So the second option (which I prefer) is to log all sa_dealloc_blocks() calls. > >>> That is, effectively, ->delete_set plus wandered journal block deallocations. > >> > >> Yes, I think to do something like this: > >> > >> reiser4_dealloc_blocks() > >> { > >> if (flags & BA_DEFER) { > >> ... > >> } else { > >> ... > >> if (discard is turned on) > >> /* we'll want to discard these blocks, which are not to > >> be represented > >> in the COMMIT SPACE MAP, so store them in a > >> separate list */ > >> do { > >> blocknr_set_add_extent(atom, > >> &atom->delete_set_for_wander, &bsep, start, len); > >> ... > >> } while (ret == -E_REPEAT); > >> sa_dealloc_blocks(reiser4_get_space_allocator(ctx->super), *start, *len); > >> ... > >> } > >> > >> Then join the sets (->delete_set, and ->delete_set_for_wander) at > >> discard time > >> (should be pretty cheap operation). insert a check at every atoms > >> fusion, that > >> ->delete_set_for_wander of both atoms are empty). > > Hm. IIRC, ultimately we wanted to use rb-trees for storing discard block sets, > > but this means that we'll need to convert two blocknr_sets into an rb-tree > > at discard time, which pretty defeats their purpose in our context... Or do you > > say that blocknr_set itself should be made rb-tree? > > I didn't know about the delete sets. They are not optional and lists are > better for them (as they don't require sorting). So I suggest to confine > with lists for now, and may be to try the fast-merge trees in the future > (if there is a great desire). The discard algorithm still needs its input to be sorted, and blocknr_set is inherently unordered. So there are two options: - log into ->delete_set and ->delete_set_for_wander (as you've proposed), then merge them into a sorted list at discard time; - log into ->discard_set (just like it is now), just fix what to log. What do you think? -- Ivan Shapovalov / intelfx /
Attachment:
signature.asc
Description: This is a digitally signed message part.