On Tue, 4 Dec 2007 05:26:28 +0100 Nick Piggin <npiggin@xxxxxxx> wrote: > Hi, > > This is my proposal for a (hopefully) backwards compatible rd driver. > The motivation for me is not pressing, except that I have this code > sitting here that is either going to rot or get merged. I'm happy to > make myself maintainer of this code, but if any real block device > driver writer would like to, that would be fine by me ;) > > Comments? > > -- > > This is a rewrite of the ramdisk block device driver. > > The old one is really difficult because it effectively implements a block > device which serves data out of its own buffer cache. It relies on the dirty > bit being set, to pin its backing store in cache, however there are non > trivial paths which can clear the dirty bit (eg. try_to_free_buffers()), > which had recently lead to data corruption. And in general it is completely > wrong for a block device driver to do this. > > The new one is more like a regular block device driver. It has no idea > about vm/vfs stuff. It's backing store is similar to the buffer cache > (a simple radix-tree of pages), but it doesn't know anything about page > cache (the pages in the radix tree are not pagecache pages). > > There is one slight downside -- direct block device access and filesystem > metadata access goes through an extra copy and gets stored in RAM twice. > However, this downside is only slight, because the real buffercache of the > device is now reclaimable (because we're not playing crazy games with it), > so under memory intensive situations, footprint should effectively be the > same -- maybe even a slight advantage to the new driver because it can also > reclaim buffer heads. what about mmap(/dev/ram0)? > The fact that it now goes through all the regular vm/fs paths makes it > much more useful for testing, too. > > text data bss dec hex filename > 2837 849 384 4070 fe6 drivers/block/rd.o > 3528 371 12 3911 f47 drivers/block/brd.o > > Text is larger, but data and bss are smaller, making total size smaller. > > A few other nice things about it: > - Similar structure and layout to the new loop device handlinag. > - Dynamic ramdisk creation. > - Runtime flexible buffer head size (because it is no longer part of the > ramdisk code). > - Boot / load time flexible ramdisk size, which could easily be extended > to a per-ramdisk runtime changeable size (eg. with an ioctl). This ramdisk driver can use highmem whereas the existing one can't (yes?). That's a pretty major difference. Plus look at the revoltingness in rd.c's mapping_set_gfp_mask()s. > +++ linux-2.6/drivers/block/brd.c > @@ -0,0 +1,536 @@ > +/* > + * Ram backed block device driver. > + * > + * Copyright (C) 2007 Nick Piggin > + * Copyright (C) 2007 Novell Inc. > + * > + * Parts derived from drivers/block/rd.c, and drivers/block/loop.c, copyright > + * of their respective owners. > + */ > + > +#include <linux/init.h> > +#include <linux/module.h> > +#include <linux/moduleparam.h> > +#include <linux/major.h> > +#include <linux/blkdev.h> > +#include <linux/bio.h> > +#include <linux/highmem.h> > +#include <linux/gfp.h> > +#include <linux/radix-tree.h> > +#include <linux/buffer_head.h> /* invalidate_bh_lrus() */ > + > +#include <asm/uaccess.h> > + > +#define SECTOR_SHIFT 9 That's our third definition of SECTOR_SHIFT. > +#define PAGE_SECTORS_SHIFT (PAGE_SHIFT - SECTOR_SHIFT) > +#define PAGE_SECTORS (1 << PAGE_SECTORS_SHIFT) > + > +/* > + * Each block ramdisk device has a radix_tree brd_pages of pages that stores > + * the pages containing the block device's contents. A brd page's ->index is > + * its offset in PAGE_SIZE units. This is similar to, but in no way connected > + * with, the kernel's pagecache or buffer cache (which sit above our block > + * device). > + */ > +struct brd_device { > + int brd_number; > + int brd_refcnt; > + loff_t brd_offset; > + loff_t brd_sizelimit; > + unsigned brd_blocksize; > + > + struct request_queue *brd_queue; > + struct gendisk *brd_disk; > + struct list_head brd_list; > + > + /* > + * Backing store of pages and lock to protect it. This is the contents > + * of the block device. > + */ > + spinlock_t brd_lock; > + struct radix_tree_root brd_pages; > +}; > + > +/* > + * Look up and return a brd's page for a given sector. > + */ > +static struct page *brd_lookup_page(struct brd_device *brd, sector_t sector) > +{ > + unsigned long idx; Could use pgoff_t here if you think that's clearer. > + struct page *page; > + > + /* > + * The page lifetime is protected by the fact that we have opened the > + * device node -- brd pages will never be deleted under us, so we > + * don't need any further locking or refcounting. > + * > + * This is strictly true for the radix-tree nodes as well (ie. we > + * don't actually need the rcu_read_lock()), however that is not a > + * documented feature of the radix-tree API so it is better to be > + * safe here (we don't have total exclusion from radix tree updates > + * here, only deletes). > + */ > + rcu_read_lock(); > + idx = sector >> PAGE_SECTORS_SHIFT; /* sector to page index */ > + page = radix_tree_lookup(&brd->brd_pages, idx); > + rcu_read_unlock(); > + > + BUG_ON(page && page->index != idx); > + > + return page; > +} > + > +/* > + * Look up and return a brd's page for a given sector. > + * If one does not exist, allocate an empty page, and insert that. Then > + * return it. > + */ > +static struct page *brd_insert_page(struct brd_device *brd, sector_t sector) > +{ > + unsigned long idx; > + struct page *page; > + > + page = brd_lookup_page(brd, sector); > + if (page) > + return page; > + > + page = alloc_page(GFP_NOIO | __GFP_HIGHMEM | __GFP_ZERO); Why GFP_NOIO? Have you thought about __GFP_MOVABLE treatment here? Seems pretty desirable as this sucker can be large. > + if (!page) > + return NULL; > + > + if (radix_tree_preload(GFP_NOIO)) { > + __free_page(page); > + return NULL; > + } > + > + spin_lock(&brd->brd_lock); > + idx = sector >> PAGE_SECTORS_SHIFT; > + if (radix_tree_insert(&brd->brd_pages, idx, page)) { > + __free_page(page); > + page = radix_tree_lookup(&brd->brd_pages, idx); > + BUG_ON(!page); > + BUG_ON(page->index != idx); > + } else > + page->index = idx; > + spin_unlock(&brd->brd_lock); > + > + radix_tree_preload_end(); > + > + return page; > +} > + > +/* > + * Free all backing store pages and radix tree. This must only be called when > + * there are no other users of the device. > + */ > +#define FREE_BATCH 16 > +static void brd_free_pages(struct brd_device *brd) > +{ > + unsigned long pos = 0; > + struct page *pages[FREE_BATCH]; > + int nr_pages; > + > + do { > + int i; > + > + nr_pages = radix_tree_gang_lookup(&brd->brd_pages, > + (void **)pages, pos, FREE_BATCH); > + > + for (i = 0; i < nr_pages; i++) { > + void *ret; > + > + BUG_ON(pages[i]->index < pos); > + pos = pages[i]->index; > + ret = radix_tree_delete(&brd->brd_pages, pos); > + BUG_ON(!ret || ret != pages[i]); > + __free_page(pages[i]); > + } > + > + pos++; > + > + } while (nr_pages == FREE_BATCH); > +} I have vague memories that radix_tree_gang_lookup()'s "naive" implementation may return fewer items than you asked for even when there are more items remaining - when it hits certain boundaries. > +/* > + * copy_to_brd_setup must be called before copy_to_brd. It may sleep. > + */ > +static int copy_to_brd_setup(struct brd_device *brd, sector_t sector, size_t n) > +{ > + unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT; > + size_t copy; > + > + copy = min((unsigned long)n, PAGE_SIZE - offset); use min_t. Or, better, sort out the types. > + if (!brd_insert_page(brd, sector)) > + return -ENOMEM; > + if (copy < n) { > + sector += copy >> SECTOR_SHIFT; > + if (!brd_insert_page(brd, sector)) > + return -ENOMEM; > + } > + return 0; > +} > + > +/* > + * Copy n bytes from src to the brd starting at sector. Does not sleep. > + */ > +static void copy_to_brd(struct brd_device *brd, const void *src, > + sector_t sector, size_t n) > +{ > + struct page *page; > + void *dst; > + unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT; > + size_t copy; > + > + copy = min((unsigned long)n, PAGE_SIZE - offset); Ditto. > + page = brd_lookup_page(brd, sector); > + BUG_ON(!page); > + > + dst = kmap_atomic(page, KM_USER1); > + memcpy(dst + offset, src, copy); > + kunmap_atomic(dst, KM_USER1); Might need flush_dcache_page() if mmap gets sorted out. > + if (copy < n) { > + src += copy; > + sector += copy >> SECTOR_SHIFT; > + copy = n - copy; > + page = brd_lookup_page(brd, sector); > + BUG_ON(!page); > + > + dst = kmap_atomic(page, KM_USER1); > + memcpy(dst, src, copy); > + kunmap_atomic(dst, KM_USER1); > + } > +} > + > +/* > + * Copy n bytes to dst from the brd starting at sector. Does not sleep. > + */ > +static void copy_from_brd(void *dst, struct brd_device *brd, > + sector_t sector, size_t n) > +{ > + struct page *page; > + const void *src; > + unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT; > + size_t copy; > + > + copy = min((unsigned long)n, PAGE_SIZE - offset); tritto. > + page = brd_lookup_page(brd, sector); > + if (page) { > + src = kmap_atomic(page, KM_USER1); > + memcpy(dst, src + offset, copy); > + kunmap_atomic(src, KM_USER1); > + } else > + memset(dst, 0, copy); > + > + if (copy < n) { > + dst += copy; > + sector += copy >> SECTOR_SHIFT; > + copy = n - copy; > + page = brd_lookup_page(brd, sector); > + if (page) { > + src = kmap_atomic(page, KM_USER1); > + memcpy(dst, src, copy); > + kunmap_atomic(src, KM_USER1); > + } else > + memset(dst, 0, copy); > + } > +} > + > +/* > + * Process a single bvec of a bio. > + */ > +static int brd_do_bvec(struct brd_device *brd, struct page *page, > + unsigned int len, unsigned int off, int rw, > + sector_t sector) > +{ > + void *mem; > + int err = 0; > + > + if (rw != READ) { > + err = copy_to_brd_setup(brd, sector, len); > + if (err) > + goto out; > + } > + > + mem = kmap_atomic(page, KM_USER0); > + if (rw == READ) { > + copy_from_brd(mem + off, brd, sector, len); > + flush_dcache_page(page); hm, there's a flush_dcache_page(). I guess you've throught it through ;) > + } else > + copy_to_brd(brd, mem + off, sector, len); > + kunmap_atomic(mem, KM_USER0); > + > +out: > + return err; > +} > + > > ... > > +static int brd_ioctl(struct inode *inode, struct file *file, > + unsigned int cmd, unsigned long arg) > +{ > + int error; > + struct block_device *bdev = inode->i_bdev; > + struct brd_device *brd = bdev->bd_disk->private_data; > + > + if (cmd != BLKFLSBUF) > + return -ENOTTY; > + > + /* > + * ram device BLKFLSBUF has special semantics, we want to actually > + * release and destroy the ramdisk data. > + */ > + mutex_lock(&bdev->bd_mutex); > + error = -EBUSY; > + if (bdev->bd_openers <= 1) { > + /* > + * Invalidate the cache first, so it isn't written > + * back to the device. > + */ > + invalidate_bh_lrus(); > + truncate_inode_pages(bdev->bd_inode->i_mapping, 0); hm, some other thread can instantiate pagecache here. I guess it's always been like that and there's not a lot we can (or should) do about it. > + brd_free_pages(brd); > + error = 0; > + } > + mutex_unlock(&bdev->bd_mutex); > + > + return error; > +} > + > > ... > > --- linux-2.6.orig/fs/buffer.c > +++ linux-2.6/fs/buffer.c > @@ -1436,6 +1436,7 @@ void invalidate_bh_lrus(void) > { > on_each_cpu(invalidate_bh_lru, NULL, 1, 1); > } > +EXPORT_SYMBOL_GPL(invalidate_bh_lrus); Maybe create a new helper function which does invalidate_bh_lrus()+truncate_inode_pages(), call that from kill_bdev() and here, make invalidate_bh_lrus() static. That's a separate patch, I guess. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html