OK, vger doesn't seem to like my patch, so I'll have to give a url to it, sorry. http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/fsblock/2.6.27-rc5/fsb-preview.patch I've been doing some work on fsblock again lately, so in case anybody might find it interesting, here is a "preview" patch. Basically it compiles and runs OK for me here, under a few stress tests. I wouldn't say it is close to bug free, and it needs a lot of bits and pieces to polish up like error handling. I've also just stripped out the large block size support in the patch I'm mailing out... I have been developing with ext2 without large lock support sizes so those paths have rotted a bit and besides they still really need a bit more changes to some VM paths. Since I last posted fsblock, there have been some big changes: - Using a per block spinlock to protect most access now. This eliminates some races I had against dirtying vs cleaning, and with fsblock refcounting and reclaim. - fsblock_no_cache aka "nobh" mode now works well due to the above. When /proc/sys/vm/fsblock_no_cache is 1, you never get fsblocks hanging around longer than they have to. You also would never be subject to the circular referencing "orphan" pages that buffer heads are subject to. - RCU is gone. This is actually a good thing because in "nobh" mode, some workloads will rapidly allocate and free the structures, and that can be costly with RCU. - struct fsblock has shrunk to 32 bytes on 64-bit. Less than 1/3 the size of struct buffer_head. Although absolute size doesn't matter so much now (because of no_cache mode). I even have an optional feature "bdflush" that increases the size... although I do want to keep it within 64 bytes (one cacheline). - added an "intermediate" mode which provides a ->data pointer in struct fsblock_meta, and means it is trivial to transition filesystems to fsblock (although they would not be able to support superpage blocks). - Added ext2 intermediate support. - Had to modify the VM a little bit in order to close races with freeing a page's fsblock before it can be cleaned (or still has a chance to be dirtied via mmap). fsblock of course ensures that zero memory allocations are required in the writeout path. - Lockless pagecache has been merged in mainline, which means the largest granularity of synchronisation anywhere in the fsblock core code is on a per-page basis (buffer uses per-inode private_lock). This is one of the reasons I am skeptical that keeping pagecache state in extents is better: it would be rather impressive if it could match the straight line speed or scalability of fsblock. - However, I *have* always agreed that it makes sense to keep (some) block state in extents, because that is going to change much less frequently, and should be represented with fewer extents provided the filesystem layout is reasonable. So I've written a (very) basic extent cache for block mappings, which can be used by filesystems that don't have good in-memory block mapping structures themselves (like ext2, for example). No reclaim for this at present, I should just add a simple shrinker. - bdflush... it's commented out so it won't build by default, but basically because fslbock properly keeps block dirty state in synch with page dirty state, I can keep sorted structure of dirty fsblocks per device, and do writeout based on that rather than this fragile walking over inodes that pdflush does. Of course it won't work with delayed allocation, so something would have to be figured out with that (perhaps allocate all outstanding blocks before each writeout pass). The thing I like about bdflush is that it can easily do nice submit ordering of inter-file as well as file/metadata blocks for writeout. I don't know if it will come to anything, but at least it is not tightly coupled with the core fsblock stuff. It's a bit hacky at the moment ;) - Still not using a private bdev for fsblock filesystems... I never got around to figuring out how to do this. This means that sometimes funny things will happen with block_dev device if pages and buffers try to use it. It mostly works OK but is a hack that I need to fix. - Finally, for those not listening last time. I'm doing block sizes larger than page size (up to 16MB IIRC, but easily expandable to much higher) with fsblock using exactly the same data structures. Although I haven't included that in the patch here. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html