Re: reiser4: discard implementation, pass 2: allocation issues

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Monday 16 June 2014 at 13:32:02, Edward Shishkin wrote:	
> On 06/16/2014 01:00 PM, Ivan Shapovalov wrote:
> > On Monday 16 June 2014 at 11:24:39, Edward Shishkin wrote:	
> >> On 06/16/2014 07:03 AM, Ivan Shapovalov wrote:
> >>> On Monday 16 June 2014 at 02:14:49, Edward Shishkin wrote:	
> >>>> On 06/15/2014 11:58 PM, Ivan Shapovalov wrote:
> >>>>> On Sunday 15 June 2014 at 23:49:59, Edward Shishkin wrote:	
> >>>>>> On 06/15/2014 08:07 PM, Ivan Shapovalov wrote:
> >>>>>>> On Sunday 15 June 2014 at 19:36:05, Edward Shishkin wrote:	
> >>>>>>>> On 06/13/2014 10:28 PM, Ivan Shapovalov wrote:
> >>>>>>>>> Here is my "analysis" of what happens in reiser4 during a transaction's
> >>>>>>>>> lifetime wrt. block allocation and deallocation.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> THE EFFECTS (SEMANTICS) OF RELATED FUNCTIONS
> >>>>>>>>> reiser4_alloc_blocks_bitmap(): allocates in WORKING BITMAP
> >>>>>>>> Yes.
> >>>>>>>>
> >>>>>>>>> reiser4_dealloc_blocks_bitmap(!BA_DEFER): deallocates from WORKING BITMAP
> >>>>>>>>> reiser4_dealloc_blocks_bitmap(BA_DEFER): stores to ->delete_set
> >>>>>>>> This is correct for the middle-level allocator (without suffix "bitmap").
> >>>>>>>> The low-level one frees blocks only in WORKING BITMAP.
> >>>>>>> Yes, this was a wording mistake. I've noticed it shortly after sending...
> >>>>>>>
> >>>>>>>>> reiser4_pre_commit_hook_bitmap(): allocates all relocated nodes in COMMIT BITMAP
> >>>>>>>> "Relocated" is bad term here: nodes with new data also get the flag
> >>>>>>>> JNODE_RELOC. So, I would rather say, that it applies freshly allocated
> >>>>>>>> nodes of the atom to COMMIT BITMAP.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>                                        deallocates ->delete_set from COMMIT BITMAP
> >>>>>>>> applies the atom's delete_set to COMMIT BITMAP
> >>>>>>> I've meant exactly this.
> >>>>>>>
> >>>>>>>>> reiser4_post_commit_hook(): deallocates ->delete_set using !BA_DEFER
> >>>>>>>>>                                                         (i. e. from WORKING BITMAP)
> >>>>>>>> applies the atom's delete_set to WORKING BITMAP.
> >>>>>>> ditto
> >>>>>>>
> >>>>>>>> I would also mention a function of the middle-level block allocator
> >>>>>>>> reiser4_alloc_blocks(): allocates blocks in WORKING  BITMAP.
> >>>>>>> Yes, sure.
> >>>>>>>
> >>>>>>>
> >>>>>>>> Note that the middle-level block allocator (block_alloc.c) actually
> >>>>>>>> manipulates
> >>>>>>>> with abstract space maps. Currently in reiser4 they are represented only by
> >>>>>>>> bitmaps (plugin/space/bitmap.c). We can also implement another
> >>>>>>>> representation -
> >>>>>>>> extent tree (like in XFS). I don't see any needs for now, though.
> >>>>>>> Extent trees seem to be interesting.. They claim that they are more efficient -
> >>>>>>> is the difference that huge?
> >>>>>>>
> >>>>>>>>> TIMELINE OF ALLOCATIONS FOR "USUAL" NODES, AND TIMELINE OF TRANSACTION COMMIT
> >>>>>>>>> - nodes are allocated using reiser4_alloc_blocks() and setting JNODE_RELOC,
> >>>>>>>>>        so WORKING BITMAP ensures that two nodes cannot get the same block;
> >>>>>>>>> - nodes are deallocated using reiser4_dealloc_blocks(BA_DEFER),
> >>>>>>>>>        so their deallocation is not immediately reflected in WORKING BITMAP;
> >>>>>>>>> (the relocate set is written here)
> >>>>>>>>> - reiser4_pre_commit_hook_bitmap() uses 1) JNODE_RELOC flag and 2) ->delete_set
> >>>>>>>>>        to convey effective bitmap changes into COMMIT BITMAP;
> >>>>>>>>> (the journal and overwrite set are written here)
> >>>>>>>>> - reiser4_post_commit_hook() uses ->delete_set to convey deallocations
> >>>>>>>>>        from step 2 to WORKING BITMAP.
> >>>>>>>>> (the discard happens here)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> TIMELINE OF ALLOCATIONS FOR WANDERED JOURNAL BLOCKS
> >>>>>>>>> - at commit time, blocks are allocated using reiser4_alloc_blocks(), so they
> >>>>>>>>>        are allocated in WORKING BITMAP and do not interfere with any "usual" blocks;
> >>>>>>>>> - after writing wandered blocks, they are deallocated using
> >>>>>>>>>        reiser4_dealloc_blocks(!BA_DEFER), i. e. from the WORKING BITMAP.
> >>>>>>>> So, the system of working and commit bitmaps plus the delete set seems
> >>>>>>>> to be redundant? I think this is because of performance reasons: block
> >>>>>>>> allocation is critical thing...
> >>>>>>> Seems like it is -- for performance and simplicity (== robustness).
> >>>>>>>
> >>>>>>>>> CONCLUSION
> >>>>>>>>> At possible transaction replay time, journal blocks are not allocated in any
> >>>>>>>>> of the bitmaps. However, because the journal is read and replayed before a
> >>>>>>>>> transaction has a chance to commit, this fact does not matter.
> >>>>>>>>> What matters is that wandered journal blocks never hit COMMIT BITMAP.
> >>>>>>>>>
> >>>>>>>>> So, if I've got all this correct (which I highly doubt), the disk space leak
> >>>>>>>>> (as you pointed it out) does not exist.
> >>>>>>>> It seems, you are right..
> >>>>>>>>
> >>>>>>>>> What exists is a rather different problem with my idea of "log every
> >>>>>>>>> deallocated block". Current implementation logs every block regardless of
> >>>>>>>>> BA_DEFER flag presence or absence, so non-wandered blocks are logged twice.
> >>>>>>>>>
> >>>>>>>>> We could just use ->delete_set, but we would lose wandered blocks then.
> >>>>>>>>> Or we could only log !BA_DEFER requests, which would do the right thing
> >>>>>>>>> (wandered blocks + deallocations from reiser4_post_commit_hook()), but
> >>>>>>>>> the reasoning behind this decision would not be obvious for a casual
> >>>>>>>>> code reader.
> >>>>>>>> I think that a good comment will save the situation..
> >>>>>>>>
> >>>>>>>>> Or we could log only wandered blocks (in addition to ->delete_set)
> >>>>>>>>> at discard time, but this is messy and requires us to merge the discard log
> >>>>>>>>> with ->delete_set at discard time.
> >>>>>>>> what is the difference with the previous "we could.."?
> >>>>>>> The previous option is to log all !BA_DEFER requests in addition to
> >>>>>>> ->delete_set, so those two block sets would be partially overlapping.
> >>>>>> Hmm.. Are they really overlapped?
> >>>>>>
> >>>>>> reiser4_dealloc_blocks() looks like the following:
> >>>>>> {
> >>>>>> if (flags & BA_DEFER)
> >>>>>>             log in the delete_set;
> >>>>>> else
> >>>>>>             deallocate in working bitmap
> >>>>>> ...
> >>>>>> }
> >>>>>>
> >>>>>> so (!BA_DEFER) requests and the delete_set look disjoint,
> >>>>>> or, I missed something?
> >>>>> You're right, I've misread the code. I was under impression that
> >>>>> reiser4_post_commit_hook() internally calls reiser4_dealloc_blocks(!BA_DEFER).
> >>>>>
> >>>>> So the second option (which I prefer) is to log all sa_dealloc_blocks() calls.
> >>>>> That is, effectively, ->delete_set plus wandered journal block deallocations.
> >>>> Yes, I think to do something like this:
> >>>>
> >>>> reiser4_dealloc_blocks()
> >>>> {
> >>>> if (flags & BA_DEFER) {
> >>>>            ...
> >>>>            } else {
> >>>>            ...
> >>>>            if (discard is turned on)
> >>>>                    /* we'll want to discard these blocks, which are not to
> >>>> be represented
> >>>>                        in the COMMIT SPACE MAP, so store them in a
> >>>> separate list */
> >>>>                    do {
> >>>>                            blocknr_set_add_extent(atom,
> >>>> &atom->delete_set_for_wander, &bsep, start, len);
> >>>>                            ...
> >>>>                    } while (ret == -E_REPEAT);
> >>>> sa_dealloc_blocks(reiser4_get_space_allocator(ctx->super), *start, *len);
> >>>>            ...
> >>>> }
> >>>>
> >>>> Then join the sets (->delete_set, and ->delete_set_for_wander) at
> >>>> discard time
> >>>> (should be pretty cheap operation). insert a check at every atoms
> >>>> fusion, that
> >>>> ->delete_set_for_wander of both atoms are empty).
> >>> Hm. IIRC, ultimately we wanted to use rb-trees for storing discard block sets,
> >>> but this means that we'll need to convert two blocknr_sets into an rb-tree
> >>> at discard time, which pretty defeats their purpose in our context... Or do you
> >>> say that blocknr_set itself should be made rb-tree?
> >> I didn't know about the delete sets. They are not optional and lists are
> >> better for them (as they don't require sorting). So I suggest to confine
> >> with lists for now, and may be to try the fast-merge trees in the future
> >> (if there is a great desire).
> > The discard algorithm still needs its input to be sorted, and blocknr_set is
> > inherently unordered.
> >
> > So there are two options:
> > - log into ->delete_set and ->delete_set_for_wander (as you've proposed),
> > then merge them into a sorted list at discard time;
> > - log into ->discard_set (just like it is now), just fix what to log.
> >
> > What do you think?
> 
> 
> Are the blocknr sets comfortable?
> Say, can we organize the discard process via the blocknr_set_iterator()?
> If yes, then let's do everything via the blocknr sets (i.e. let's 
> implement the
> first option). Add needed operations to blocknrset.c, the discard iteration
> actor to discard.c, etc.

No, they aren't, and this is a problem. As I said, they are not just unsorted,
they are unsortable as I understand them.

-- 
Ivan Shapovalov / intelfx /

Attachment: signature.asc
Description: This is a digitally signed message part.


[Index of Archives]     [Linux File System Development]     [Linux BTRFS]     [Linux NFS]     [Linux Filesystems]     [Ext4 Filesystem]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite Forum]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]     [Linux Resources]

  Powered by Linux