On Monday September 15, npiggin@xxxxxxx wrote: > OK, vger doesn't seem to like my patch, so I'll have to give a url to it, > sorry. Thank (A URL is lots better than nothing)! Some comments. 1/ I think you need to include a design document as a patch to Documentation/something. This would help me not to ask the same question twice, such as: 2/ Why do you allow pages in the same file to have either blocks or buffers. Surely it should be set-in-concrete on a per-address-space (and probably per-filesystem) basis whether buffer_heads or blocks or something else is used. I know I've asked this before so I found the answer in my mail archives. Some filesystems store their metadata in the same address-space that block_device.c uses for reading/writing the device. For this to work they must both use the same buffering strategy. My feeling is that the complexity that you add to switch between buffers and block on a per-page basis is too great a cost, and you should avoid this. My preferred approach would be to require the filesystem to use a separate address-space to store the metadata (even if it is still physically indexed). However I suspect that would cause breakage for things like tune2fs as the caches would no longer be coherent. (Would that actually be a problem for anything other than the filesystem superblock? In which case, just continue to use buffer_heads for the superblock and fsblock for the rest). The alternative would be for block_device.c to either use buffers or blocks depending on how the filesystem is using it. So on first open, default to using buffers. When the filesystem bd_claims the device it can switch it to using blocks. It already must potentially flush the whole cache if it is changing block size. This is just a small extra step. You would probably just replace the address_space_operations structure in something similar to set_blocksize, between sync_blockdev and kill_bdev. This might end up being more code, but it would remove some real ugliness from your current patch. At the very least, I think that the difference between using buffer_heads and fsblocks should be entirely encoded in the PageBlocks flags, and that PagePrivate should be left to mean exactly what it currently does: That the ->private field is in use by something. 3/ Why do blocks hold a refcount on the page. This seems wrong. The setting of PagePrivate (or PageBlock) is the equivalent of a refcount. If it is set, then releasepage will be called to when the VM wants to drop the page, and invalidatepage will be called when it insists on dropping the page. No need for a refcount. I found a hint in the patch which suggests that this is needed for superpage support. So when you truncate a file at the middle of a superpage, fsblock can still hold on to the pages that truncate_inode_pages wants to get rid of. However as you didn't include that I cannot yet comment on whether that seems justifiable. As I think more on this, I'm wondering if I've missed something (entirely possible). If each block does indeed hold a refcount on the page, then freeing idle pages which happen to have blocks attached is not ever going to happen. This would be ok if you dropped the fsblocks as soon as they weren't in used, but your commentary seems to suggest that is an option, not a certainty. Confused. 4/ In truncate_inode_pages_range, you have: - if (page_index > end) { - next = page_index; + next = page_index+1; + if (next-1 > end) break; - } - if (page_index > next) - next = page_index; - next++; This reminds me of very similar code in invalidate_mapping_pages which has the comment /* * We really shouldn't be looking at the ->index of an * unlocked page. But we're not allowed to lock these * pages. So we rely upon nobody altering the ->index * of this (pinned-by-us) page. */ (inserted by a patch attributed to me, but I think the text is from Andrew Morton). The theory is that until you get the page lock, the ->mapping and ->index of a page can change (I'm not sure how. Maybe splice??). Note that truncate_complete_page (Which is called fairly shortly after the patch included above) starts with if (page->mapping != mapping) return; which seems to justify the need for caution. Now, back to the change that you made: The "before" code will only ever increase next based on the (possibly random) value retrieved from ->index. Your "after" code always sets it explicitly. It could be argued that this is not as safe. I'm not really sure how important this all is, and whether ->index really can give a surprise value in this context. But I thought I would mention it, and suggest that if you really want to make this change which (along with the reversal of trylock_page and testing PageWriteback) doesn't seem to be directly related to fsblock, then maybe it should be a separate patch so it could be reviewed more effectively. Either way, it would be good to have the code in 'truncate' and 'invalidate' as similar as possible, including comments. 5/ Still in truncate_inode_pages_range: what is the point of making any changes in there at all? They seem to be just syntactic rearrangement with no obvious gain. I can see that you add a test for PageWriteback before trylock_page, which could avoid taking a page lock in some cases so that might be a micro-optimisation. I can also see that you have dropped a called to cond_resched(), which probably needs some justification. In any case, this should all be in a separate patch. 6/ I would love to have more words explaining the new address_space_operations. We even have a file to contain them: Documentation/filesystems/vfs.txt. I'm guessing that the "any private data" is typically buffers/blocks holding metadata (e.g. indirect blocks) Hmmm.. looking further is seems that courtesy of struct mb_assoc, one fsblock of metadata can be shared by multiple address_spaces. I guess that is for block-allocation bitmaps etc. I'm beginning to wonder if this belongs in the address_space rather than directly in the inode. Part of my reasoning for this is that I didn't think that i_data should normally be used in generic code, i_mapping should be used instead. But all accesses to private_list and assoc_mapping use i_data, suggesting that it is really part of the inode. I realise that you didn't make it this way, but if it is a bad design (which I'm beginning to suspect) then we should perpetuate it by putting the 'release' and 'sync' operations in the address_space. They should rather be inode_operations I think. 7/ Those kmallocs in fsblock_init are a worry. Surely is isn't hard to come up with a scheme for static allocation. And I think you should remove "if (i < 30)" and just leave it as a simple 'else'. 8/ Given that sector_t is unsigned, code like + block->block_nr = -1; (in init_blocks) bothers me. + sector_t last_block_nr = (sector_t)-1ULL; Bothers me even more. "-1" as an unsigned long long ?? I would much prefer you used (~0ULL). But maybe that's just me. 9/ I'm a bit confused by fsblock_writeout_data and its interaction with writeout_block and the PageWriteback flag. This is in the subpage case. fsblock_writeout_data seems to want to write out lots of blocks in sequential order. It will find a block, lock the page, and (if the block is still dirty) start writeout using writeout_block. This will set PageWriteback on the page. fsblock_writeout_data will often find the next block which is in the same page, but as PageWriteback is now set, it will not try to lock and writeout that block. So I can imagine it writing out only the first block of each page. Surely this cannot be intended. Am I missing something? 10/ list_add_tail(new, head) is better than list_add(new, head->prev) 11/ nit. checkpatch.pl reports. total: 67 errors, 141 warnings, 7239 lines checke I thought to run it because I noticed white-space-damage in blkdev_get_block. 12/ In 'invalidate_block' you have commented out lock_block, and put in a comment "why lock?". The reason to lock is to wait for pending writeout to complete. If the writeout was triggered by the VFS/VM, then the page will have PageWriteback set and we will already have waited for that to clear. However if the writeout was triggered internally by the filesystem (such as for data ordering guarantees) then PageWriteback will not have been set, but the block (not the page) will be locked. At least that is how I understand buffer_heads to work. However for fsblock you set PageWriteback whenever you write out any block on the page. As noted in point 9 above, I'm not convinced that is the correct thing to do. And finally: 13/ + /* + * XXX: maybe use private alloc funcions so fses can embed block into + * their fs-private block rather than using ->private? Maybe ->private + * is easier though... + */ Yes, that's just what I was going to say. I find 'struct fsblock_meta' slightly confusing. It extends 'struct fsblock' in two ways. 1/ It embeds fsblock in itself, just adding a 'data' pointer. 2/ It sometimes sets ->private to a 'struct mb_assoc'. So it looks like you "have a bob each way". Both embedding and using ->private. It seems to me that you have a core 'fsblock' subsystem, and a secondary fsblock_meta subsystem which would be useful for some filesystems and useless to other filesystems which handle metadata in a different way. It would be nice if these two were clearly delineated. It would also be nice if filesystems could extend fsblock in what ever way suited them. Using embedding seems to be the most common approach in Linux these days, and saves a pointer. Using ->private has the advantage of allowing you to allocate little arrays of fsblocks in common code because you "know" how big they should be. More importantly it allows 'for_each_block' to work fairly easily. Given that the fs would need to allocate it's own 'private' info anyway, for_each_block is really the only benefit of using ->private. It shouldn't be to hard to use (say) 8 bits for fsblock to record the size of the full structure (much as you currently record the number of blocks-per-page) and then for_each_block can use that. You could even save a couple of bits but allocating one int plus N filesystem_block structures, and recording 'N' and 'sizeof filesystem_block' in the 'int'. You could even put the spinlock bit in that int rather than (ab)using the lsb of page->private. I think that will do for now. NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html