Re: [patch] fsblock preview

Neil Brown <neilb@xxxxxxx> · Tue, 16 Sep 2008 21:35:16 +1000

On Monday September 15, npiggin@xxxxxxx wrote:
> OK, vger doesn't seem to like my patch, so I'll have to give a url to it,
> sorry.

Thank (A URL is lots better than nothing)!

Some comments.

1/ I think you need to include a design document as a patch to
   Documentation/something.  This would help me not to ask the same
   question twice, such as:

2/ Why do you allow pages in the same file to have either blocks or
   buffers.   Surely it should be set-in-concrete on a per-address-space
   (and probably per-filesystem) basis whether buffer_heads or blocks
   or something else is used.

   I know I've asked this before so I found the answer in my mail
   archives.
   Some filesystems store their metadata in the same address-space
   that block_device.c uses for reading/writing the device.  For this
   to work they must both use the same buffering strategy.

   My feeling is that the complexity that you add to switch between
   buffers and block on a per-page basis is too great a cost, and you
   should avoid this.
   My preferred approach would be to require the filesystem to use a
   separate address-space to store the metadata (even if it is still
   physically indexed).  However I suspect that would cause breakage
   for things like tune2fs as the caches would no longer be coherent.
   (Would that actually be a problem for anything other than the
    filesystem superblock?  In which case, just continue to use
   buffer_heads for the superblock and fsblock for the rest).

   The alternative would be for block_device.c to either use buffers
   or blocks depending on how the filesystem is using it.
   So on first open, default to using buffers.  When the filesystem
   bd_claims the device it can switch it to using blocks.  It already
   must potentially flush the whole cache if it is changing block
   size.  This is just a small extra step.  You would probably just
   replace the address_space_operations structure in something similar
   to set_blocksize, between sync_blockdev and kill_bdev.

   This might end up being more code, but it would remove some real
   ugliness from your current patch.

   At the very least, I think that the difference between using
   buffer_heads and fsblocks should be entirely encoded in the
   PageBlocks flags, and that PagePrivate should be left to mean
   exactly what it currently does:  That the ->private field is in
   use by something.

3/ Why do blocks hold a refcount on the page.  This seems wrong.
   The setting of PagePrivate (or PageBlock) is the equivalent of a
   refcount.  If it is set, then releasepage will be called to when
   the VM wants to drop the page, and invalidatepage will be called
   when it insists on dropping the page.  No need for a refcount.

   I found a hint in the patch which suggests that this is needed for 
   superpage support.  So when you truncate a file at the middle of a
   superpage, fsblock can still hold on to the pages that
   truncate_inode_pages wants to get rid of.  However as you didn't
   include that I cannot yet comment on whether that seems
   justifiable.

   As I think more on this, I'm wondering if I've missed something
   (entirely possible).  If each block does indeed hold a refcount on
   the page, then freeing idle pages which happen to have blocks
   attached is not ever going to happen.  This would be ok if you
   dropped the fsblocks as soon as they weren't in used, but your
   commentary seems to suggest that is an option, not a certainty. 
   Confused.

4/ In truncate_inode_pages_range, you have:

-			if (page_index > end) {
-				next = page_index;
+			next = page_index+1;
+			if (next-1 > end)
 				break;
-			}

-			if (page_index > next)
-				next = page_index;
-			next++;

  This reminds me of very similar code in invalidate_mapping_pages
  which has the comment
			/*
			 * We really shouldn't be looking at the ->index of an
			 * unlocked page.  But we're not allowed to lock these
			 * pages.  So we rely upon nobody altering the ->index
			 * of this (pinned-by-us) page.
			 */
  (inserted by a patch attributed to me, but I think the text is from
   Andrew Morton).
  The theory is that until you get the page lock, the ->mapping and
  ->index of a page can change (I'm not sure how.  Maybe splice??).
  Note that truncate_complete_page (Which is called fairly shortly
  after the patch included above) starts with
	if (page->mapping != mapping)
		return;

  which seems to justify the need for caution.

  Now, back to the change that you made:
   The "before" code will only ever increase next based on the
   (possibly random) value retrieved from ->index.  Your "after" code
   always sets it explicitly.   It could be argued that this is not
   as safe.

  I'm not really sure how important this all is, and whether ->index
  really can give a surprise value in this context.  But I thought I
  would mention it, and suggest that if you really want to make this
  change which (along with the reversal of trylock_page and testing
  PageWriteback) doesn't seem to be directly related to fsblock, then
  maybe it should be a separate patch so it could be reviewed more
  effectively.  Either way, it would be good to have the code in
  'truncate' and 'invalidate' as similar as possible, including
  comments.

5/ Still in truncate_inode_pages_range: what is the point of making
   any changes in there at all?  They seem to be just syntactic
   rearrangement with no obvious gain.
   I can see that you add a test for PageWriteback before
   trylock_page, which could avoid taking a page lock in some cases so
   that might be a micro-optimisation.
   I can also see that you have dropped a called to cond_resched(),
   which probably needs some justification.
   In any case, this should all be in a separate patch.

6/ I would love to have more words explaining the new
   address_space_operations.  We even have a file to contain them:
     Documentation/filesystems/vfs.txt.

   I'm guessing that the "any private data" is typically
   buffers/blocks holding metadata (e.g. indirect blocks)

   Hmmm.. looking further is seems that courtesy of struct mb_assoc,
   one fsblock of metadata can be shared by multiple address_spaces.
   I guess that is for block-allocation bitmaps etc.

   I'm beginning to wonder if this belongs in the address_space rather
   than directly in the inode.  Part of my reasoning for this is that
   I didn't think that i_data should normally be used in generic code,
   i_mapping should be used instead.  But all accesses to private_list
   and assoc_mapping use i_data, suggesting that it is really part of
   the inode.

   I realise that you didn't make it this way, but if it is a bad
   design (which I'm beginning to suspect) then we should perpetuate
   it by putting the 'release' and 'sync' operations in the
   address_space.  They should rather be inode_operations I think.

7/ Those kmallocs in fsblock_init are a worry.  Surely is isn't hard
   to come up with a scheme for static allocation.
   And I think you should remove "if (i < 30)" and just leave it as
   a simple 'else'.

8/ Given that sector_t is unsigned, code like
+	block->block_nr = -1;
   (in init_blocks) bothers me.
+	sector_t last_block_nr = (sector_t)-1ULL;
   Bothers me even more. "-1" as an unsigned long long ??
   I would much prefer you used (~0ULL).  But maybe that's just me. 

9/ I'm a bit confused by fsblock_writeout_data and its interaction
   with writeout_block and the PageWriteback flag.  This is in the
   subpage case.

   fsblock_writeout_data seems to want to write out lots of blocks
   in sequential order.
   It will find a block, lock the page, and (if the block is still
   dirty) start writeout using writeout_block.  This will set
   PageWriteback on the page.

   fsblock_writeout_data will often find the next block which is in
   the same page, but as PageWriteback is now set, it will not try to
   lock and writeout that block.  So I can imagine it writing out
   only the first block of each page.  Surely this cannot be
   intended.  Am I missing something?

10/  list_add_tail(new, head) is better than list_add(new, head->prev)

11/ nit.  checkpatch.pl reports.
  total: 67 errors, 141 warnings, 7239 lines checke
  I thought to run it because I noticed white-space-damage in
  blkdev_get_block. 

12/ In 'invalidate_block' you have commented out lock_block, and 
   put in a comment "why lock?".

   The reason to lock is to wait for pending writeout to complete.
   If the writeout was triggered by the VFS/VM, then the page will
   have PageWriteback set and we will already have waited for that to
   clear.  However if the writeout was triggered internally by the
   filesystem (such as for data ordering guarantees) then
   PageWriteback will not have been set, but the block (not the page)
   will be locked.

   At least that is how I understand buffer_heads to work.  However
   for fsblock you set PageWriteback whenever you write out any block
   on the page.  As noted in point 9 above,  I'm not convinced that is
   the correct thing to do.

And finally:

13/ 
+	/*
+	 * XXX: maybe use private alloc funcions so fses can embed block into
+	 * their fs-private block rather than using ->private? Maybe ->private
+	 * is easier though...
+	 */

   Yes, that's just what I was going to say.

   I find 'struct fsblock_meta' slightly confusing.   It extends
   'struct fsblock' in two ways.
   1/ It embeds fsblock in itself, just adding a 'data' pointer.
   2/ It sometimes sets ->private to a 'struct mb_assoc'.

   So it looks like you "have a bob each way".  Both embedding and using
   ->private.

   It seems to me that you have a core 'fsblock' subsystem, and a
   secondary fsblock_meta subsystem which would be useful for some
   filesystems and useless to other filesystems which handle metadata
   in a different way.
   It would be nice if these two were clearly delineated.
   It would also be nice if filesystems could extend fsblock in what
   ever way suited them.  
   Using embedding seems to be the most common approach in Linux these
   days, and saves a pointer.  Using ->private has the advantage of
   allowing you to allocate little arrays of fsblocks in common code
   because you "know" how big they should be.  More importantly it
   allows 'for_each_block' to work fairly easily.

   Given that the fs would need to allocate it's own 'private' info
   anyway, for_each_block is really the only benefit of using
   ->private.
   It shouldn't be to hard to use (say) 8 bits for fsblock to record
   the size of the full structure (much as you currently record the
   number of blocks-per-page) and then for_each_block can use that.

   You could even save a couple of bits but allocating one int plus
   N filesystem_block structures, and recording 'N' and 'sizeof
   filesystem_block'  in the 'int'.  You could even put the spinlock
   bit in that int rather than (ab)using the lsb of page->private.

I think that will do for now.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html