On Monday 16 June 2014 at 13:32:02, Edward Shishkin wrote: > On 06/16/2014 01:00 PM, Ivan Shapovalov wrote: > > On Monday 16 June 2014 at 11:24:39, Edward Shishkin wrote: > >> On 06/16/2014 07:03 AM, Ivan Shapovalov wrote: > >>> On Monday 16 June 2014 at 02:14:49, Edward Shishkin wrote: > >>>> On 06/15/2014 11:58 PM, Ivan Shapovalov wrote: > >>>>> On Sunday 15 June 2014 at 23:49:59, Edward Shishkin wrote: > >>>>>> On 06/15/2014 08:07 PM, Ivan Shapovalov wrote: > >>>>>>> On Sunday 15 June 2014 at 19:36:05, Edward Shishkin wrote: > >>>>>>>> On 06/13/2014 10:28 PM, Ivan Shapovalov wrote: > >>>>>>>>> Here is my "analysis" of what happens in reiser4 during a transaction's > >>>>>>>>> lifetime wrt. block allocation and deallocation. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> THE EFFECTS (SEMANTICS) OF RELATED FUNCTIONS > >>>>>>>>> reiser4_alloc_blocks_bitmap(): allocates in WORKING BITMAP > >>>>>>>> Yes. > >>>>>>>> > >>>>>>>>> reiser4_dealloc_blocks_bitmap(!BA_DEFER): deallocates from WORKING BITMAP > >>>>>>>>> reiser4_dealloc_blocks_bitmap(BA_DEFER): stores to ->delete_set > >>>>>>>> This is correct for the middle-level allocator (without suffix "bitmap"). > >>>>>>>> The low-level one frees blocks only in WORKING BITMAP. > >>>>>>> Yes, this was a wording mistake. I've noticed it shortly after sending... > >>>>>>> > >>>>>>>>> reiser4_pre_commit_hook_bitmap(): allocates all relocated nodes in COMMIT BITMAP > >>>>>>>> "Relocated" is bad term here: nodes with new data also get the flag > >>>>>>>> JNODE_RELOC. So, I would rather say, that it applies freshly allocated > >>>>>>>> nodes of the atom to COMMIT BITMAP. > >>>>>>>> > >>>>>>>> > >>>>>>>>> deallocates ->delete_set from COMMIT BITMAP > >>>>>>>> applies the atom's delete_set to COMMIT BITMAP > >>>>>>> I've meant exactly this. > >>>>>>> > >>>>>>>>> reiser4_post_commit_hook(): deallocates ->delete_set using !BA_DEFER > >>>>>>>>> (i. e. from WORKING BITMAP) > >>>>>>>> applies the atom's delete_set to WORKING BITMAP. > >>>>>>> ditto > >>>>>>> > >>>>>>>> I would also mention a function of the middle-level block allocator > >>>>>>>> reiser4_alloc_blocks(): allocates blocks in WORKING BITMAP. > >>>>>>> Yes, sure. > >>>>>>> > >>>>>>> > >>>>>>>> Note that the middle-level block allocator (block_alloc.c) actually > >>>>>>>> manipulates > >>>>>>>> with abstract space maps. Currently in reiser4 they are represented only by > >>>>>>>> bitmaps (plugin/space/bitmap.c). We can also implement another > >>>>>>>> representation - > >>>>>>>> extent tree (like in XFS). I don't see any needs for now, though. > >>>>>>> Extent trees seem to be interesting.. They claim that they are more efficient - > >>>>>>> is the difference that huge? > >>>>>>> > >>>>>>>>> TIMELINE OF ALLOCATIONS FOR "USUAL" NODES, AND TIMELINE OF TRANSACTION COMMIT > >>>>>>>>> - nodes are allocated using reiser4_alloc_blocks() and setting JNODE_RELOC, > >>>>>>>>> so WORKING BITMAP ensures that two nodes cannot get the same block; > >>>>>>>>> - nodes are deallocated using reiser4_dealloc_blocks(BA_DEFER), > >>>>>>>>> so their deallocation is not immediately reflected in WORKING BITMAP; > >>>>>>>>> (the relocate set is written here) > >>>>>>>>> - reiser4_pre_commit_hook_bitmap() uses 1) JNODE_RELOC flag and 2) ->delete_set > >>>>>>>>> to convey effective bitmap changes into COMMIT BITMAP; > >>>>>>>>> (the journal and overwrite set are written here) > >>>>>>>>> - reiser4_post_commit_hook() uses ->delete_set to convey deallocations > >>>>>>>>> from step 2 to WORKING BITMAP. > >>>>>>>>> (the discard happens here) > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> TIMELINE OF ALLOCATIONS FOR WANDERED JOURNAL BLOCKS > >>>>>>>>> - at commit time, blocks are allocated using reiser4_alloc_blocks(), so they > >>>>>>>>> are allocated in WORKING BITMAP and do not interfere with any "usual" blocks; > >>>>>>>>> - after writing wandered blocks, they are deallocated using > >>>>>>>>> reiser4_dealloc_blocks(!BA_DEFER), i. e. from the WORKING BITMAP. > >>>>>>>> So, the system of working and commit bitmaps plus the delete set seems > >>>>>>>> to be redundant? I think this is because of performance reasons: block > >>>>>>>> allocation is critical thing... > >>>>>>> Seems like it is -- for performance and simplicity (== robustness). > >>>>>>> > >>>>>>>>> CONCLUSION > >>>>>>>>> At possible transaction replay time, journal blocks are not allocated in any > >>>>>>>>> of the bitmaps. However, because the journal is read and replayed before a > >>>>>>>>> transaction has a chance to commit, this fact does not matter. > >>>>>>>>> What matters is that wandered journal blocks never hit COMMIT BITMAP. > >>>>>>>>> > >>>>>>>>> So, if I've got all this correct (which I highly doubt), the disk space leak > >>>>>>>>> (as you pointed it out) does not exist. > >>>>>>>> It seems, you are right.. > >>>>>>>> > >>>>>>>>> What exists is a rather different problem with my idea of "log every > >>>>>>>>> deallocated block". Current implementation logs every block regardless of > >>>>>>>>> BA_DEFER flag presence or absence, so non-wandered blocks are logged twice. > >>>>>>>>> > >>>>>>>>> We could just use ->delete_set, but we would lose wandered blocks then. > >>>>>>>>> Or we could only log !BA_DEFER requests, which would do the right thing > >>>>>>>>> (wandered blocks + deallocations from reiser4_post_commit_hook()), but > >>>>>>>>> the reasoning behind this decision would not be obvious for a casual > >>>>>>>>> code reader. > >>>>>>>> I think that a good comment will save the situation.. > >>>>>>>> > >>>>>>>>> Or we could log only wandered blocks (in addition to ->delete_set) > >>>>>>>>> at discard time, but this is messy and requires us to merge the discard log > >>>>>>>>> with ->delete_set at discard time. > >>>>>>>> what is the difference with the previous "we could.."? > >>>>>>> The previous option is to log all !BA_DEFER requests in addition to > >>>>>>> ->delete_set, so those two block sets would be partially overlapping. > >>>>>> Hmm.. Are they really overlapped? > >>>>>> > >>>>>> reiser4_dealloc_blocks() looks like the following: > >>>>>> { > >>>>>> if (flags & BA_DEFER) > >>>>>> log in the delete_set; > >>>>>> else > >>>>>> deallocate in working bitmap > >>>>>> ... > >>>>>> } > >>>>>> > >>>>>> so (!BA_DEFER) requests and the delete_set look disjoint, > >>>>>> or, I missed something? > >>>>> You're right, I've misread the code. I was under impression that > >>>>> reiser4_post_commit_hook() internally calls reiser4_dealloc_blocks(!BA_DEFER). > >>>>> > >>>>> So the second option (which I prefer) is to log all sa_dealloc_blocks() calls. > >>>>> That is, effectively, ->delete_set plus wandered journal block deallocations. > >>>> Yes, I think to do something like this: > >>>> > >>>> reiser4_dealloc_blocks() > >>>> { > >>>> if (flags & BA_DEFER) { > >>>> ... > >>>> } else { > >>>> ... > >>>> if (discard is turned on) > >>>> /* we'll want to discard these blocks, which are not to > >>>> be represented > >>>> in the COMMIT SPACE MAP, so store them in a > >>>> separate list */ > >>>> do { > >>>> blocknr_set_add_extent(atom, > >>>> &atom->delete_set_for_wander, &bsep, start, len); > >>>> ... > >>>> } while (ret == -E_REPEAT); > >>>> sa_dealloc_blocks(reiser4_get_space_allocator(ctx->super), *start, *len); > >>>> ... > >>>> } > >>>> > >>>> Then join the sets (->delete_set, and ->delete_set_for_wander) at > >>>> discard time > >>>> (should be pretty cheap operation). insert a check at every atoms > >>>> fusion, that > >>>> ->delete_set_for_wander of both atoms are empty). > >>> Hm. IIRC, ultimately we wanted to use rb-trees for storing discard block sets, > >>> but this means that we'll need to convert two blocknr_sets into an rb-tree > >>> at discard time, which pretty defeats their purpose in our context... Or do you > >>> say that blocknr_set itself should be made rb-tree? > >> I didn't know about the delete sets. They are not optional and lists are > >> better for them (as they don't require sorting). So I suggest to confine > >> with lists for now, and may be to try the fast-merge trees in the future > >> (if there is a great desire). > > The discard algorithm still needs its input to be sorted, and blocknr_set is > > inherently unordered. > > > > So there are two options: > > - log into ->delete_set and ->delete_set_for_wander (as you've proposed), > > then merge them into a sorted list at discard time; > > - log into ->discard_set (just like it is now), just fix what to log. > > > > What do you think? > > > Are the blocknr sets comfortable? > Say, can we organize the discard process via the blocknr_set_iterator()? > If yes, then let's do everything via the blocknr sets (i.e. let's > implement the > first option). Add needed operations to blocknrset.c, the discard iteration > actor to discard.c, etc. No, they aren't, and this is a problem. As I said, they are not just unsorted, they are unsortable as I understand them. -- Ivan Shapovalov / intelfx /
Attachment:
signature.asc
Description: This is a digitally signed message part.