Re: reiser4: discard implementation, pass 2: allocation issues

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Monday 16 June 2014 at 11:24:39, Edward Shishkin wrote:	
> 
> On 06/16/2014 07:03 AM, Ivan Shapovalov wrote:
> > On Monday 16 June 2014 at 02:14:49, Edward Shishkin wrote:	
> >> On 06/15/2014 11:58 PM, Ivan Shapovalov wrote:
> >>> On Sunday 15 June 2014 at 23:49:59, Edward Shishkin wrote:	
> >>>> On 06/15/2014 08:07 PM, Ivan Shapovalov wrote:
> >>>>> On Sunday 15 June 2014 at 19:36:05, Edward Shishkin wrote:	
> >>>>>> On 06/13/2014 10:28 PM, Ivan Shapovalov wrote:
> >>>>>>> Here is my "analysis" of what happens in reiser4 during a transaction's
> >>>>>>> lifetime wrt. block allocation and deallocation.
> >>>>>>>
> >>>>>>>
> >>>>>>> THE EFFECTS (SEMANTICS) OF RELATED FUNCTIONS
> >>>>>>> reiser4_alloc_blocks_bitmap(): allocates in WORKING BITMAP
> >>>>>> Yes.
> >>>>>>
> >>>>>>> reiser4_dealloc_blocks_bitmap(!BA_DEFER): deallocates from WORKING BITMAP
> >>>>>>> reiser4_dealloc_blocks_bitmap(BA_DEFER): stores to ->delete_set
> >>>>>> This is correct for the middle-level allocator (without suffix "bitmap").
> >>>>>> The low-level one frees blocks only in WORKING BITMAP.
> >>>>> Yes, this was a wording mistake. I've noticed it shortly after sending...
> >>>>>
> >>>>>>> reiser4_pre_commit_hook_bitmap(): allocates all relocated nodes in COMMIT BITMAP
> >>>>>> "Relocated" is bad term here: nodes with new data also get the flag
> >>>>>> JNODE_RELOC. So, I would rather say, that it applies freshly allocated
> >>>>>> nodes of the atom to COMMIT BITMAP.
> >>>>>>
> >>>>>>
> >>>>>>>                                       deallocates ->delete_set from COMMIT BITMAP
> >>>>>> applies the atom's delete_set to COMMIT BITMAP
> >>>>> I've meant exactly this.
> >>>>>
> >>>>>>> reiser4_post_commit_hook(): deallocates ->delete_set using !BA_DEFER
> >>>>>>>                                                        (i. e. from WORKING BITMAP)
> >>>>>> applies the atom's delete_set to WORKING BITMAP.
> >>>>> ditto
> >>>>>
> >>>>>> I would also mention a function of the middle-level block allocator
> >>>>>> reiser4_alloc_blocks(): allocates blocks in WORKING  BITMAP.
> >>>>> Yes, sure.
> >>>>>
> >>>>>
> >>>>>> Note that the middle-level block allocator (block_alloc.c) actually
> >>>>>> manipulates
> >>>>>> with abstract space maps. Currently in reiser4 they are represented only by
> >>>>>> bitmaps (plugin/space/bitmap.c). We can also implement another
> >>>>>> representation -
> >>>>>> extent tree (like in XFS). I don't see any needs for now, though.
> >>>>> Extent trees seem to be interesting.. They claim that they are more efficient -
> >>>>> is the difference that huge?
> >>>>>
> >>>>>>> TIMELINE OF ALLOCATIONS FOR "USUAL" NODES, AND TIMELINE OF TRANSACTION COMMIT
> >>>>>>> - nodes are allocated using reiser4_alloc_blocks() and setting JNODE_RELOC,
> >>>>>>>       so WORKING BITMAP ensures that two nodes cannot get the same block;
> >>>>>>> - nodes are deallocated using reiser4_dealloc_blocks(BA_DEFER),
> >>>>>>>       so their deallocation is not immediately reflected in WORKING BITMAP;
> >>>>>>> (the relocate set is written here)
> >>>>>>> - reiser4_pre_commit_hook_bitmap() uses 1) JNODE_RELOC flag and 2) ->delete_set
> >>>>>>>       to convey effective bitmap changes into COMMIT BITMAP;
> >>>>>>> (the journal and overwrite set are written here)
> >>>>>>> - reiser4_post_commit_hook() uses ->delete_set to convey deallocations
> >>>>>>>       from step 2 to WORKING BITMAP.
> >>>>>>> (the discard happens here)
> >>>>>>>
> >>>>>>>
> >>>>>>> TIMELINE OF ALLOCATIONS FOR WANDERED JOURNAL BLOCKS
> >>>>>>> - at commit time, blocks are allocated using reiser4_alloc_blocks(), so they
> >>>>>>>       are allocated in WORKING BITMAP and do not interfere with any "usual" blocks;
> >>>>>>> - after writing wandered blocks, they are deallocated using
> >>>>>>>       reiser4_dealloc_blocks(!BA_DEFER), i. e. from the WORKING BITMAP.
> >>>>>> So, the system of working and commit bitmaps plus the delete set seems
> >>>>>> to be redundant? I think this is because of performance reasons: block
> >>>>>> allocation is critical thing...
> >>>>> Seems like it is -- for performance and simplicity (== robustness).
> >>>>>
> >>>>>>> CONCLUSION
> >>>>>>> At possible transaction replay time, journal blocks are not allocated in any
> >>>>>>> of the bitmaps. However, because the journal is read and replayed before a
> >>>>>>> transaction has a chance to commit, this fact does not matter.
> >>>>>>> What matters is that wandered journal blocks never hit COMMIT BITMAP.
> >>>>>>>
> >>>>>>> So, if I've got all this correct (which I highly doubt), the disk space leak
> >>>>>>> (as you pointed it out) does not exist.
> >>>>>> It seems, you are right..
> >>>>>>
> >>>>>>> What exists is a rather different problem with my idea of "log every
> >>>>>>> deallocated block". Current implementation logs every block regardless of
> >>>>>>> BA_DEFER flag presence or absence, so non-wandered blocks are logged twice.
> >>>>>>>
> >>>>>>> We could just use ->delete_set, but we would lose wandered blocks then.
> >>>>>>> Or we could only log !BA_DEFER requests, which would do the right thing
> >>>>>>> (wandered blocks + deallocations from reiser4_post_commit_hook()), but
> >>>>>>> the reasoning behind this decision would not be obvious for a casual
> >>>>>>> code reader.
> >>>>>> I think that a good comment will save the situation..
> >>>>>>
> >>>>>>> Or we could log only wandered blocks (in addition to ->delete_set)
> >>>>>>> at discard time, but this is messy and requires us to merge the discard log
> >>>>>>> with ->delete_set at discard time.
> >>>>>> what is the difference with the previous "we could.."?
> >>>>> The previous option is to log all !BA_DEFER requests in addition to
> >>>>> ->delete_set, so those two block sets would be partially overlapping.
> >>>> Hmm.. Are they really overlapped?
> >>>>
> >>>> reiser4_dealloc_blocks() looks like the following:
> >>>> {
> >>>> if (flags & BA_DEFER)
> >>>>            log in the delete_set;
> >>>> else
> >>>>            deallocate in working bitmap
> >>>> ...
> >>>> }
> >>>>
> >>>> so (!BA_DEFER) requests and the delete_set look disjoint,
> >>>> or, I missed something?
> >>> You're right, I've misread the code. I was under impression that
> >>> reiser4_post_commit_hook() internally calls reiser4_dealloc_blocks(!BA_DEFER).
> >>>
> >>> So the second option (which I prefer) is to log all sa_dealloc_blocks() calls.
> >>> That is, effectively, ->delete_set plus wandered journal block deallocations.
> >>
> >> Yes, I think to do something like this:
> >>
> >> reiser4_dealloc_blocks()
> >> {
> >> if (flags & BA_DEFER) {
> >>           ...
> >>           } else {
> >>           ...
> >>           if (discard is turned on)
> >>                   /* we'll want to discard these blocks, which are not to
> >> be represented
> >>                       in the COMMIT SPACE MAP, so store them in a
> >> separate list */
> >>                   do {
> >>                           blocknr_set_add_extent(atom,
> >> &atom->delete_set_for_wander, &bsep, start, len);
> >>                           ...
> >>                   } while (ret == -E_REPEAT);
> >> sa_dealloc_blocks(reiser4_get_space_allocator(ctx->super), *start, *len);
> >>           ...
> >> }
> >>
> >> Then join the sets (->delete_set, and ->delete_set_for_wander) at
> >> discard time
> >> (should be pretty cheap operation). insert a check at every atoms
> >> fusion, that
> >> ->delete_set_for_wander of both atoms are empty).
> > Hm. IIRC, ultimately we wanted to use rb-trees for storing discard block sets,
> > but this means that we'll need to convert two blocknr_sets into an rb-tree
> > at discard time, which pretty defeats their purpose in our context... Or do you
> > say that blocknr_set itself should be made rb-tree?
> 
> I didn't know about the delete sets. They are not optional and lists are
> better for them (as they don't require sorting). So I suggest to confine
> with lists for now, and may be to try the fast-merge trees in the future
> (if there is a great desire).

The discard algorithm still needs its input to be sorted, and blocknr_set is
inherently unordered.

So there are two options:
- log into ->delete_set and ->delete_set_for_wander (as you've proposed),
then merge them into a sorted list at discard time;
- log into ->discard_set (just like it is now), just fix what to log.

What do you think?

-- 
Ivan Shapovalov / intelfx /

Attachment: signature.asc
Description: This is a digitally signed message part.


[Index of Archives]     [Linux File System Development]     [Linux BTRFS]     [Linux NFS]     [Linux Filesystems]     [Ext4 Filesystem]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite Forum]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]     [Linux Resources]

  Powered by Linux