Re: [patch] fsblock preview

Nick Piggin <npiggin@xxxxxxx> · Tue, 23 Sep 2008 06:39:22 +0200

Thanks for the review so far, and sorry for not getting back to you earlier.
Just starting to get back into the swing of things.

On Tue, Sep 16, 2008 at 09:35:16PM +1000, Neil Brown wrote:
> On Monday September 15, npiggin@xxxxxxx wrote:
> > OK, vger doesn't seem to like my patch, so I'll have to give a url to it,
> > sorry.
> 
> Thank (A URL is lots better than nothing)!
> 
> Some comments.
> 
> 1/ I think you need to include a design document as a patch to
>    Documentation/something.  This would help me not to ask the same
>    question twice, such as:

Probably, or at least more documentation in the .c files. But the code
really had been in flux up until not long ago -- I switched from using
atomic bitops on the fsblock flags to having a lock to protect them. I
don't anticipate further changes on that scale, but possibly a few
things will be required eg if I look at switching ext3 to fsblocks...
But anyway now that it is settling down I will make a point to start
adding more documentation. Good point.

> 2/ Why do you allow pages in the same file to have either blocks or
>    buffers.   Surely it should be set-in-concrete on a per-address-space
>    (and probably per-filesystem) basis whether buffer_heads or blocks
>    or something else is used.

Yeah, this is a hack because I'm using the block device for both types
of filesystems...

>    I know I've asked this before so I found the answer in my mail
>    archives.
>    Some filesystems store their metadata in the same address-space
>    that block_device.c uses for reading/writing the device.  For this
>    to work they must both use the same buffering strategy.
> 
>    My feeling is that the complexity that you add to switch between
>    buffers and block on a per-page basis is too great a cost, and you
>    should avoid this.
>    My preferred approach would be to require the filesystem to use a
>    separate address-space to store the metadata (even if it is still
>    physically indexed).  However I suspect that would cause breakage
>    for things like tune2fs as the caches would no longer be coherent.
>    (Would that actually be a problem for anything other than the
>     filesystem superblock?  In which case, just continue to use
>    buffer_heads for the superblock and fsblock for the rest).
> 
>    The alternative would be for block_device.c to either use buffers
>    or blocks depending on how the filesystem is using it.
>    So on first open, default to using buffers.  When the filesystem
>    bd_claims the device it can switch it to using blocks.  It already
>    must potentially flush the whole cache if it is changing block
>    size.  This is just a small extra step.  You would probably just
>    replace the address_space_operations structure in something similar
>    to set_blocksize, between sync_blockdev and kill_bdev.
> 
>    This might end up being more code, but it would remove some real
>    ugliness from your current patch.

Definitely that is the plan, which Christoph also suggested a year
ago. It will make all that much nicer. Coherency should not be a problem
because for block devices we have the big-hammer coherency of flushing on
close, and I suspect online fs tools would use ioctls or syscalls into
the filesystem rather than raw bdev access.

The issue is that I just never spent enough time figuring out how to do
this. I ended up with code that "works" so there wasn't enough pressure
to change it :P Actually it is now the most ugly part of the patchset, so
I should really fix this too.

>    At the very least, I think that the difference between using
>    buffer_heads and fsblocks should be entirely encoded in the
>    PageBlocks flags, and that PagePrivate should be left to mean
>    exactly what it currently does:  That the ->private field is in
>    use by something.

It's even better than that. When fsblock filesystems use their own linear
address space for metadata, PageBlocks can go away completely. PagePrivate,
to the VM, just means that it must call into the aops in some situations.
Then the aops themselves decide what to do... so if we have no "mixed"
buffer and fsblock aops, then there is no need to distinguish :)

> 3/ Why do blocks hold a refcount on the page.  This seems wrong.
>    The setting of PagePrivate (or PageBlock) is the equivalent of a
>    refcount.  If it is set, then releasepage will be called to when
>    the VM wants to drop the page, and invalidatepage will be called
>    when it insists on dropping the page.  No need for a refcount.

Yeah, well... this is just how it has been done. The VM expects a refcount,
but I agree it doesn't really need an extra one.

>    I found a hint in the patch which suggests that this is needed for 
>    superpage support.  So when you truncate a file at the middle of a
>    superpage, fsblock can still hold on to the pages that
>    truncate_inode_pages wants to get rid of.  However as you didn't
>    include that I cannot yet comment on whether that seems
>    justifiable.

Yeah, that's one of the tricky places that needs a bit more VM changes to
really get it right.

>    As I think more on this, I'm wondering if I've missed something
>    (entirely possible).  If each block does indeed hold a refcount on
>    the page, then freeing idle pages which happen to have blocks
>    attached is not ever going to happen.  This would be ok if you
>    dropped the fsblocks as soon as they weren't in used, but your
>    commentary seems to suggest that is an option, not a certainty. 
>    Confused.

Yeah, it's just that we already expect PagePrivate to elevate the refcount
everywhere... actually if it didn't, I think the VM might not reclaim those
pages ;) Buffer heads do the same thing.

> 4/ In truncate_inode_pages_range, you have:
> 
> -			if (page_index > end) {
> -				next = page_index;
> +			next = page_index+1;
> +			if (next-1 > end)
>  				break;
> -			}
>  
> -			if (page_index > next)
> -				next = page_index;
> -			next++;
> 
>   This reminds me of very similar code in invalidate_mapping_pages
>   which has the comment
> 			/*
> 			 * We really shouldn't be looking at the ->index of an
> 			 * unlocked page.  But we're not allowed to lock these
> 			 * pages.  So we rely upon nobody altering the ->index
> 			 * of this (pinned-by-us) page.
> 			 */
>   (inserted by a patch attributed to me, but I think the text is from
>    Andrew Morton).
>   The theory is that until you get the page lock, the ->mapping and
>   ->index of a page can change (I'm not sure how.  Maybe splice??).
>   Note that truncate_complete_page (Which is called fairly shortly
>   after the patch included above) starts with
> 	if (page->mapping != mapping)
> 		return;
> 
>   which seems to justify the need for caution.
> 
>   Now, back to the change that you made:
>    The "before" code will only ever increase next based on the
>    (possibly random) value retrieved from ->index.  Your "after" code
>    always sets it explicitly.   It could be argued that this is not
>    as safe.

Well yeah, it's a hack because the radix tree range operations are not
really good for the task at hand. Ideally they would obey ranges and
operate on an iterator or at least return the next range to start looking
at for the next lookup.

Actually, index should not change, and mapping should only ever go to
NULL I think. But people get paranoid about truncate and do funny checks
(which sometimes serve to confuse the issue).

But yeah, most of the changes to generic code need to be put into seperate
patches and in some cases (possibly this one), the change is just a relic
and no longer required. Don't worry too much about these for the moment.

>   I'm not really sure how important this all is, and whether ->index
>   really can give a surprise value in this context.  But I thought I
>   would mention it, and suggest that if you really want to make this
>   change which (along with the reversal of trylock_page and testing
>   PageWriteback) doesn't seem to be directly related to fsblock, then
>   maybe it should be a separate patch so it could be reviewed more
>   effectively.  Either way, it would be good to have the code in
>   'truncate' and 'invalidate' as similar as possible, including
>   comments.

Agreed.

> 5/ Still in truncate_inode_pages_range: what is the point of making
>    any changes in there at all?  They seem to be just syntactic
>    rearrangement with no obvious gain.
>    I can see that you add a test for PageWriteback before
>    trylock_page, which could avoid taking a page lock in some cases so
>    that might be a micro-optimisation.
>    I can also see that you have dropped a called to cond_resched(),
>    which probably needs some justification.
>    In any case, this should all be in a separate patch.

Yep.

> 6/ I would love to have more words explaining the new
>    address_space_operations.  We even have a file to contain them:
>      Documentation/filesystems/vfs.txt.

Yes, they would all need to be in individual patches, properly documented
etc.

>    I'm guessing that the "any private data" is typically
>    buffers/blocks holding metadata (e.g. indirect blocks)
> 
>    Hmmm.. looking further is seems that courtesy of struct mb_assoc,
>    one fsblock of metadata can be shared by multiple address_spaces.
>    I guess that is for block-allocation bitmaps etc.

Yes.

>    I'm beginning to wonder if this belongs in the address_space rather
>    than directly in the inode.  Part of my reasoning for this is that
>    I didn't think that i_data should normally be used in generic code,
>    i_mapping should be used instead.  But all accesses to private_list
>    and assoc_mapping use i_data, suggesting that it is really part of
>    the inode.
> 
>    I realise that you didn't make it this way, but if it is a bad
>    design (which I'm beginning to suspect) then we should perpetuate
>    it by putting the 'release' and 'sync' operations in the
>    address_space.  They should rather be inode_operations I think.

Hmm... I don't know coda code at all, but AFAIK it doesn't really make
sense to use both i_mapping and i_data at the same time if they are
different. At any rate I guess we get away with it because nothing 
that uses buffer heads switches the i_mapping?

> 7/ Those kmallocs in fsblock_init are a worry.  Surely is isn't hard
>    to come up with a scheme for static allocation.
>    And I think you should remove "if (i < 30)" and just leave it as
>    a simple 'else'.

Oh? I don't see such a problem except I should panic if they fail I
suppose (rather than oops strangely).

> 8/ Given that sector_t is unsigned, code like
> +	block->block_nr = -1;
>    (in init_blocks) bothers me.
> +	sector_t last_block_nr = (sector_t)-1ULL;
>    Bothers me even more. "-1" as an unsigned long long ??
>    I would much prefer you used (~0ULL).  But maybe that's just me. 

I guess (sector_t)ULLONG_MAX ?

> 9/ I'm a bit confused by fsblock_writeout_data and its interaction
>    with writeout_block and the PageWriteback flag.  This is in the
>    subpage case.
> 
>    fsblock_writeout_data seems to want to write out lots of blocks
>    in sequential order.
>    It will find a block, lock the page, and (if the block is still
>    dirty) start writeout using writeout_block.  This will set
>    PageWriteback on the page.
> 
>    fsblock_writeout_data will often find the next block which is in
>    the same page, but as PageWriteback is now set, it will not try to
>    lock and writeout that block.  So I can imagine it writing out
>    only the first block of each page.  Surely this cannot be
>    intended.  Am I missing something?

No... it's just been ignored :P It's not a trivial problem, but for the
trivial case at least (blocks linear within page), I should be doing page
based writeout definitely.

> 10/  list_add_tail(new, head) is better than list_add(new, head->prev)

OK.

> 11/ nit.  checkpatch.pl reports.
>   total: 67 errors, 141 warnings, 7239 lines checke
>   I thought to run it because I noticed white-space-damage in
>   blkdev_get_block. 
> 
> 12/ In 'invalidate_block' you have commented out lock_block, and 
>    put in a comment "why lock?".
> 
>    The reason to lock is to wait for pending writeout to complete.
>    If the writeout was triggered by the VFS/VM, then the page will
>    have PageWriteback set and we will already have waited for that to
>    clear.  However if the writeout was triggered internally by the
>    filesystem (such as for data ordering guarantees) then
>    PageWriteback will not have been set, but the block (not the page)
>    will be locked.
> 
>    At least that is how I understand buffer_heads to work.  However
>    for fsblock you set PageWriteback whenever you write out any block
>    on the page.  As noted in point 9 above,  I'm not convinced that is
>    the correct thing to do.

Yeah, fsblock keeps dirty and writeout coherent, and uses only the page
lock (rather than page and buffer lock that buffer heads use) for fsblock
internal locking. I have not found any cases where the locking is needed,
but I was doing it in case minix/xfs/ext2/etc required it for some reason.

But if I can have a clean break, I'll just require filesystems do their
own locking rather than use the lock here ;) In fsblock I'm never going to
add code with comments that look like /* I think reiserfs does this */,
/* Does this still happen */ etc ;)

> And finally:
> 
> 13/ 
> +	/*
> +	 * XXX: maybe use private alloc funcions so fses can embed block into
> +	 * their fs-private block rather than using ->private? Maybe ->private
> +	 * is easier though...
> +	 */
> 
>    Yes, that's just what I was going to say.
> 
>    I find 'struct fsblock_meta' slightly confusing.   It extends
>    'struct fsblock' in two ways.
>    1/ It embeds fsblock in itself, just adding a 'data' pointer.
>    2/ It sometimes sets ->private to a 'struct mb_assoc'.
> 
>    So it looks like you "have a bob each way".  Both embedding and using
>    ->private.

Yep. It's a bigger problem for fsblock core because I'd need to do
mindless duplication of a lot of function names if fsblock was not
in the first part of fsblock_meta.

For filesystems I think they would know far more often whether a given
path is using data or metadata blocks, so I think the seperation is
still a reasonable idea.

>    It seems to me that you have a core 'fsblock' subsystem, and a
>    secondary fsblock_meta subsystem which would be useful for some
>    filesystems and useless to other filesystems which handle metadata
>    in a different way.
>    It would be nice if these two were clearly delineated.

They somewhat are, but they share so many common helpers that they seem
pretty entangled. filesystems should definitely be able to use one or
the other data / metadata handling functions though...

>    It would also be nice if filesystems could extend fsblock in what
>    ever way suited them.  
>    Using embedding seems to be the most common approach in Linux these
>    days, and saves a pointer.  Using ->private has the advantage of
>    allowing you to allocate little arrays of fsblocks in common code
>    because you "know" how big they should be.  More importantly it
>    allows 'for_each_block' to work fairly easily.
> 
>    Given that the fs would need to allocate it's own 'private' info
>    anyway, for_each_block is really the only benefit of using
>    ->private.
>    It shouldn't be to hard to use (say) 8 bits for fsblock to record
>    the size of the full structure (much as you currently record the
>    number of blocks-per-page) and then for_each_block can use that.
> 
>    You could even save a couple of bits but allocating one int plus
>    N filesystem_block structures, and recording 'N' and 'sizeof
>    filesystem_block'  in the 'int'.

I have considered approaches like that... It's not a clear win, given
that in other cases the filesystem may not always want its private data
to be attached an allocated, or it may need to attach different types
of private data.

But yes there are pros and cons of each. Basically though, there was no
clear fundamental reason I could see that makes one the best way to go,
so I decided to use the existing scheme so as not to make my life harder
when converting filesystems ;)

>  You could even put the spinlock
>    bit in that int rather than (ab)using the lsb of page->private.

Well page->private of fsblock-using-filesystem's pages belongs to fsblock
subsystem, so it's not abuse. It's neat to have the lock there because I
can take it without having any fsblocks allocated.

> I think that will do for now.

Thanks, you make some very good points.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html