Re: [PATCH 02/35] Btrfs multi-device code

Chris Mason <chris.mason@xxxxxxxxxx> · Tue, 13 Jan 2009 11:36:16 -0500

On Mon, 2009-01-12 at 22:23 -0800, Andrew Morton wrote:
> On Wed,  7 Jan 2009 22:56:52 -0500 Chris Mason <chris.mason@xxxxxxxxxx> wrote:
> 
> > All block and extent pointers in btrfs use a logical address space.  This file
> > is responsible for mapping that logical address to a physical block on disk.
> > 
> > It implements raid0, raid1 and raid10.  The block groups used by extent-tree.c
> > are allocated from each device in chunks of ~1GB in size.  This file maintains
> > the chunk tree and device allocation tree that are used to record the
> > areas of each disk that are in use.
> > 
> > Each super block has a bootstrapping section that is used to translate
> > the extents in the chunk tree.  The chunk tree is used to translate all
> > of the other extents in the filesystem to a physical block on disk.
> 
> Where would one go to find out about the user-visible management of
> these devices?

Hi Andrew,

Thanks for all the time you've spent here.

The progs are growing manpages, and the wiki has the user-visible side
documented.

http://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices
http://btrfs.wiki.kernel.org/index.php/Category:Documentation

> > This file also provides async bio submission code.  The bios are often
> > sent down by async helper threads that are supposed to be working on
> > checksumming or other cpu intensive tasks.  We don't want those tasks
> > waiting for a free request, so a helper thread is used here that is willing
> > to actually wait on the drive.
> 
> Seems odd.  Why have a thread to wait on IO?  Why not just
> fire-and-forget, and then only block if some caller wishes to use the
> results of that IO?
> 

If there was a submit_bio that never waited for a request, I'd use it ;)
This extra thread pool just makes sure the checksumming threads don't
get stuck in get_request_wait()

Throttling happens when async things are added to the checksumming queue
instead.

> >
> > ...
> >
> > +struct map_lookup {
> > +	u64 type;
> > +	int io_align;
> > +	int io_width;
> > +	int stripe_len;
> > +	int sector_size;
> > +	int num_stripes;
> > +	int sub_stripes;
> > +	struct btrfs_bio_stripe stripes[];
> > +};
> 
> this structure looks interesting.

There are a few places that need to know all of the locations a logical
block might map to on the disk.  It mainly gets used by the error
handling code so that it can loop through all the copies of a block when
checksums don't match or we get io errors.

But yes, will commit.

> > +static int init_first_rw_device(struct btrfs_trans_handle *trans,
> > +				struct btrfs_root *root,
> > +				struct btrfs_device *device);
> > +static int btrfs_relocate_sys_chunks(struct btrfs_root *root);
> > +
> > +#define map_lookup_size(n) (sizeof(struct map_lookup) + \
> > +			    (sizeof(struct btrfs_bio_stripe) * (n)))
> > +
> > +static DEFINE_MUTEX(uuid_mutex);
> > +static LIST_HEAD(fs_uuids);
> > +
> > +void btrfs_lock_volumes(void)
> > +{
> > +	mutex_lock(&uuid_mutex);
> > +}
> > +
> > +void btrfs_unlock_volumes(void)
> > +{
> > +	mutex_unlock(&uuid_mutex);
> > +}
> > +
> > +static void lock_chunks(struct btrfs_root *root)
> > +{
> > +	mutex_lock(&root->fs_info->chunk_mutex);
> > +}
> > +
> > +static void unlock_chunks(struct btrfs_root *root)
> > +{
> > +	mutex_unlock(&root->fs_info->chunk_mutex);
> > +}
> > +
> > +static void free_fs_devices(struct btrfs_fs_devices *fs_devices)
> > +{
> > +	struct btrfs_device *device;
> > +	WARN_ON(fs_devices->opened);
> > +	while (!list_empty(&fs_devices->devices)) {
> > +		device = list_entry(fs_devices->devices.next,
> > +				    struct btrfs_device, dev_list);
> > +		list_del(&device->dev_list);
> > +		kfree(device->name);
> > +		kfree(device);
> > +	}
> > +	kfree(fs_devices);
> > +}
> 
> Called under btrfs_lock_volumes(), one assumes/hopes.  A
> WARN_ON(!mutex_is_locked()) is a nice way of documenting such things.

Good point, thanks

> 
> > +int btrfs_cleanup_fs_uuids(void)
> > +{
> > +	struct btrfs_fs_devices *fs_devices;
> > +
> > +	while (!list_empty(&fs_uuids)) {
> > +		fs_devices = list_entry(fs_uuids.next,
> > +					struct btrfs_fs_devices, list);
> > +		list_del(&fs_devices->list);
> > +		free_fs_devices(fs_devices);
> > +	}
> > +	return 0;
> > +}
> 
> Ditto.
> 
> > +static noinline struct btrfs_device *__find_device(struct list_head *head,
> > +						   u64 devid, u8 *uuid)
> 
> It's not possible for the code reader to work out why all these
> noinlines are here (this reader, at least).

This is the "please don't make huge functions because we don't have
infinite stack size" noinline.  Eric Sandeen had a patch that gave that
a real name, I'll dig it up and switch to it.

> 
> > +{
> > +	struct btrfs_device *dev;
> > +	struct list_head *cur;
> > +
> > +	list_for_each(cur, head) {
> > +		dev = list_entry(cur, struct btrfs_device, dev_list);
> > +		if (dev->devid == devid &&
> > +		    (!uuid || !memcmp(dev->uuid, uuid, BTRFS_UUID_SIZE))) {
> > +			return dev;
> > +		}
> > +	}
> > +	return NULL;
> > +}
> > +
> > +static noinline struct btrfs_fs_devices *find_fsid(u8 *fsid)
> > +{
> > +	struct list_head *cur;
> > +	struct btrfs_fs_devices *fs_devices;
> > +
> > +	list_for_each(cur, &fs_uuids) {
> > +		fs_devices = list_entry(cur, struct btrfs_fs_devices, list);
> > +		if (memcmp(fsid, fs_devices->fsid, BTRFS_FSID_SIZE) == 0)
> > +			return fs_devices;
> > +	}
> > +	return NULL;
> > +}
> 
> Called under btrfs_lock_volumes()...

Nod

> 
> > +/*
> > + * we try to collect pending bios for a device so we don't get a large
> > + * number of procs sending bios down to the same device.  This greatly
> > + * improves the schedulers ability to collect and merge the bios.
> 
> eh?  The block layer already implements limiting.
> 
> Why does _not_ sending bios to a device aid the IO scheduler's merging
> activity?
> 
> The comment needs help, I suspect.

run_scheduled_bios happens from the async worker thread.  The goal is to
have the fewest possible number of async threads calling submit_bio
because that generally turns nice sequential IO from the higher layers
into seeky IO.

> 
> > + * But, it also turns into a long list of bios to process and that is sure
> > + * to eventually make the worker thread block.
> 
> Why is that bad?  If there's a large amount of IO queued then blocking
> seems an appropriate thing to do.

Each btrfs device has a list of bios that are pending submission.
run_scheduled_bios is called one device at a time.  If
run_scheduled_bios notices that a given block device is congested, it
may requeue the work for the congested device and return.

This lets the async thread move on to the next device instead of getting
stuck on the congested device.

> 
> >  The solution here is to
> > + * make some progress and then put this work struct back at the end of
> > + * the list if the block device is congested.  This way, multiple devices
> > + * can make progress from a single worker thread.
> 
> hrm, OK.  I think.
> 

I'll add more comments there ;)

> > + */
> > +static noinline int run_scheduled_bios(struct btrfs_device *device)
> > +{
> > +	struct bio *pending;
> > +	struct backing_dev_info *bdi;
> > +	struct btrfs_fs_info *fs_info;
> > +	struct bio *tail;
> > +	struct bio *cur;
> > +	int again = 0;
> > +	unsigned long num_run = 0;
> > +	unsigned long limit;
> > +
> > +	bdi = device->bdev->bd_inode->i_mapping->backing_dev_info;
> 
> device->bdev->bd_inode_backing_dev_info is equivalent and shorter, I suspect.

Ok

>
> >
> 
> I forget why we added that.
> 
> > +	fs_info = device->dev_root->fs_info;
> > +	limit = btrfs_async_submit_limit(fs_info);
> > +	limit = limit * 2 / 3;
> 
> ??

There have been about 10 different schemes to throttle async requests.
I'm still working on the perfect one...

> > +
> > +	while (pending) {
> > +		cur = pending;
> > +		pending = pending->bi_next;
> > +		cur->bi_next = NULL;
> > +		atomic_dec(&fs_info->nr_async_bios);
> > +
> > +		if (atomic_read(&fs_info->nr_async_bios) < limit &&
> > +		    waitqueue_active(&fs_info->async_submit_wait))
> > +			wake_up(&fs_info->async_submit_wait);
> > +
> > +		BUG_ON(atomic_read(&cur->bi_cnt) == 0);
> > +		bio_get(cur);
> > +		submit_bio(cur->bi_rw, cur);
> > +		bio_put(cur);
> 
> What do the bio_get()/bio_put() do?  Looks wrong, or unneeded.

Not needed, will drop.  I consistently put them everywhere while hunting
for a memory corruption.

> 
> > +		num_run++;
> > +
> > +		/*
> > +		 * we made progress, there is more work to do and the bdi
> > +		 * is now congested.  Back off and let other work structs
> > +		 * run instead
> > +		 */
> > +		if (pending && bdi_write_congested(bdi) &&
> 
> Does this function handle only writes?  If it handles reads, why do we
> test only write congestion?  (please add comments if the questions
> aren't painfully stupid).

Only writes, but since that wasn't obvious it'll get more comments as
well.

> > +
> > +static noinline int device_list_add(const char *path,
> > +			   struct btrfs_super_block *disk_super,
> > +			   u64 devid, struct btrfs_fs_devices **fs_devices_ret)
> > +{
> > +	struct btrfs_device *device;
> > +	struct btrfs_fs_devices *fs_devices;
> > +	u64 found_transid = btrfs_super_generation(disk_super);
> > +
> > +	fs_devices = find_fsid(disk_super->fsid);
> > +	if (!fs_devices) {
> > +		fs_devices = kzalloc(sizeof(*fs_devices), GFP_NOFS);
> 
> Wondering why it's GFP_NOFS.
> 
> Presumably this function is supposed to be called under
> btrfs_lock_volumes(), and we also call that function on the memory
> reclaim path?

Its mostly because my fingers can't type GFP without NOFS.  That code
doesn't need the NOFS though.

> > +
> > +static struct btrfs_fs_devices *clone_fs_devices(struct
> btrfs_fs_devices *orig)
> 
> I wonder why this function exists.

It came with Yan Zheng's seed FS support, and he'll add comments to all
the related code.  Seed filesystems let you include one filesystem as a
readonly starting point to another FS.  Basically:

* Create the perfect virtual machine image on /dev/some_disk_in_the_san
* Use /dev/sda to hold all the local modifications to the image

Other variations on readonly media like CD images would work too.

This does work today, btrfstune was added to set a filesystem as a seed.

You just mount it and add a writable device to it and you get a new FS
on the writable device that references the seed.

> > +{
> > +	struct btrfs_fs_devices *fs_devices;
> > +	struct btrfs_device *device;
> > +	struct btrfs_device *orig_dev;
> > +
> > +	fs_devices = kzalloc(sizeof(*fs_devices), GFP_NOFS);
> > +	if (!fs_devices)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	INIT_LIST_HEAD(&fs_devices->devices);
> > +	INIT_LIST_HEAD(&fs_devices->alloc_list);
> > +	INIT_LIST_HEAD(&fs_devices->list);
> > +	fs_devices->latest_devid = orig->latest_devid;
> > +	fs_devices->latest_trans = orig->latest_trans;
> > +	memcpy(fs_devices->fsid, orig->fsid, sizeof(fs_devices->fsid));
> > +
> > +	list_for_each_entry(orig_dev, &orig->devices, dev_list) {
> > +		device = kzalloc(sizeof(*device), GFP_NOFS);
> > +		if (!device)
> > +			goto error;
> > +
> > +		device->name = kstrdup(orig_dev->name, GFP_NOFS);
> > +		if (!device->name)
> > +			goto error;
> 
> We leaked *device?

We surely did, thanks.

> 
> > +		device->devid = orig_dev->devid;
> > +		device->work.func = pending_bios_fn;
> > +		memcpy(device->uuid, orig_dev->uuid, sizeof(device->uuid));
> > +		device->barriers = 1;
> > +		spin_lock_init(&device->io_lock);
> > +		INIT_LIST_HEAD(&device->dev_list);
> > +		INIT_LIST_HEAD(&device->dev_alloc_list);
> > +
> > +		list_add(&device->dev_list, &fs_devices->devices);
> > +		device->fs_devices = fs_devices;
> > +		fs_devices->num_devices++;
> > +	}
> > +	return fs_devices;
> > +error:
> > +	free_fs_devices(fs_devices);
> > +	return ERR_PTR(-ENOMEM);
> > +}
> > +
> > +int btrfs_close_extra_devices(struct btrfs_fs_devices *fs_devices)
> 
> whatdoesthisdo

Userland asks the btrfs module to probe a device, and it searches for FS
uuids and magic.  Each device has a device UUID (my name is xxxx) and an
FS uuid (I live in FS yyyy).

The kernel maintains a mapping of FS uuid to a list of devices where
that uuid was found.

At mount time, it reads the superblocks of each device in the list and
finds the most recent copy of the btree that lists all the device UUIDs
in the FS.

So, we have two lists of devices.  One in the kernel came from a
userland initiated probe.  One list in the kernel came from FS metadata.
One of three states exist:

1) The two lists match up and the mount can proceed.
2) The kernel didn't open all the devices that belong in the FS.
Mounting can't proceed.
3) The kernel opened a device that wasn't actually in the FS.  It is
extra and should be closed.

This code closes the devices that were extra in step 3.  It might happen
if the device was removed from the FS or failed etc etc.

(will add comment)

> 
> > +{
> > +	struct list_head *tmp;
> > +	struct list_head *cur;
> > +	struct btrfs_device *device;
> > +
> > +	mutex_lock(&uuid_mutex);
> 
> btrfs_lock_volumes()?

We should pick one or the other.  btrfs_lock_volumes used to do much
more, I'd rather drop it and change calls to explicitly use the
uuid_mutex.

> 
> > +again:
> > +	list_for_each_safe(cur, tmp, &fs_devices->devices) {
> > +		device = list_entry(cur, struct btrfs_device, dev_list);
> > +		if (device->in_fs_metadata)
> > +			continue;
> > +
> > +		if (device->bdev) {
> > +			close_bdev_exclusive(device->bdev, device->mode);
> > +			device->bdev = NULL;
> > +			fs_devices->open_devices--;
> > +		}
> > +		if (device->writeable) {
> > +			list_del_init(&device->dev_alloc_list);
> > +			device->writeable = 0;
> > +			fs_devices->rw_devices--;
> > +		}
> 
> Why are we bothering to modify *device?  It's about to be freed.

Good point.

> 
> > +		list_del_init(&device->dev_list);
> > +		fs_devices->num_devices--;
> > +		kfree(device->name);
> > +		kfree(device);
> > +	}
> > +
> > +	if (fs_devices->seed) {
> > +		fs_devices = fs_devices->seed;
> > +		goto again;
> > +	}
> > +
> > +	mutex_unlock(&uuid_mutex);
> 
> ...
> 
> > +	return 0;
> > +}
> > +
> > +static int __btrfs_close_devices(struct btrfs_fs_devices *fs_devices)
> > +{
> > +	struct list_head *cur;
> > +	struct btrfs_device *device;
> > +
> > +	if (--fs_devices->opened > 0)
> > +		return 0;
> > +
> > +	list_for_each(cur, &fs_devices->devices) {
> > +		device = list_entry(cur, struct btrfs_device, dev_list);
> > +		if (device->bdev) {
> 
> This reader is wondering what btrfs_device.bdev==NULL means.  How would
> one go about determining this?

The device list includes devices opened (bdev != NULL) and devices the
metadata says should be there but that we haven't found (bdev == NULL).

(Will add comment)

> 
> > +			close_bdev_exclusive(device->bdev, device->mode);
> > +			fs_devices->open_devices--;
> > +		}
> > +		if (device->writeable) {
> > +			list_del_init(&device->dev_alloc_list);
> > +			fs_devices->rw_devices--;
> > +		}
> > +
> > +		device->bdev = NULL;
> > +		device->writeable = 0;
> > +		device->in_fs_metadata = 0;
> 
> What does in_fs_metadata mean?

in_fs_metadata means the filesystem metadata on disk actually references
the device UUID.

> 
> Perhaps those three fields should have type `bool'.

Yes, its a bool.

> 
> > +	}
> > +	WARN_ON(fs_devices->open_devices);
> > +	WARN_ON(fs_devices->rw_devices);
> > +	fs_devices->opened = 0;
> > +	fs_devices->seeding = 0;
> 
> What does the `seeding' field mean?  It's a boolean, I assume..

Yes, seeding is a bool (see seed filesystem support above)

> 
> > +	return 0;
> > +}
> > +
> > +int btrfs_close_devices(struct btrfs_fs_devices *fs_devices)
> > +{
> > +	struct btrfs_fs_devices *seed_devices = NULL;
> > +	int ret;
> > +
> > +	mutex_lock(&uuid_mutex);
> 
> btrfs_lock_volumes()?  (Please check whole fs for consistency here)
> 
> Seems strange that a lock called "uuid_mutex" is protecting a pile of
> things which don't seem very uuid-related.
> 

The lists and mappings it protects are all keyed by FS or device
uuids...made sense at the time I suppose.  Will pick a better name.

> > +
> > +static int __btrfs_open_devices(struct btrfs_fs_devices
> *fs_devices,
> > +				fmode_t flags, void *holder)
> > +{
> > +	struct block_device *bdev;
> > +	struct list_head *head = &fs_devices->devices;
> > +	struct list_head *cur;
> > +	struct btrfs_device *device;
> > +	struct block_device *latest_bdev = NULL;
> > +	struct buffer_head *bh;
> > +	struct btrfs_super_block *disk_super;
> > +	u64 latest_devid = 0;
> > +	u64 latest_transid = 0;
> > +	u64 devid;
> > +	int seeding = 1;
> > +	int ret = 0;
> > +
> > +	list_for_each(cur, head) {
> > +		device = list_entry(cur, struct btrfs_device, dev_list);
> 
> list_for_each_entry()

ack
> 
> > +		if (device->bdev)
> > +			continue;
> > +		if (!device->name)
> > +			continue;
> 
> What's the algorithm for understanding the above four lines?

if (device->bdev) we're already open.
if (!device->name) the device uuid hasn't been discovered yet. The entry
in the device list records the device should be there but isn't.

> 
> > +		bdev = open_bdev_exclusive(device->name, flags, holder);
> > +		if (IS_ERR(bdev)) {
> > +			printk(KERN_INFO "open %s failed\n", device->name);
> > +			goto error;
> > +		}
> > +		set_blocksize(bdev, 4096);
> 
> hm.  What's the situation with btrfs blocksizes?

Basically we only need the blocksize to be set to a consistent value.
buffer_heads are only used for the super blocks.

This will be true even after we fixup the extent_buffer and extent_io
code to work with blocksize != pagesize, since the size of the super is
4096.

> > +
> > +error_brelse:
> > +		brelse(bh);
> 
> brelse() is only needed in situations where the pointer might be NULL.

put_bh doesn't give us the friendly warnings if we've messed up the
usage count, and I try to avoid the __ unless I really need the go
faster stripes.

> > +}
> > +	bh = btrfs_read_dev_super(bdev);
> > +	if (!bh) {
> > +		ret = -EIO;
> > +		goto error_close;
> > +	}
> > +	disk_super = (struct btrfs_super_block *)bh->b_data;
> 
> <looks at struct btrfs_super_block>
> 
> It needs work, I think.  This:
> 
> 	__le64 compat_ro_flags;
> 	__le64 incompat_flags;
> 	__le16 csum_type;
> 	u8 root_level;
> 	u8 chunk_root_level;
> 	u8 log_root_level;
> 	struct btrfs_dev_item dev_item;
> 
> doesn't look like it will successfully port between architectures. 
> Also `struct btrfs_dev_item' has three le32's wedged between le64's.
> 
> I suppose that the __packed will help in some cases (but not the
> btrfs_super_block, surely).  However it seems unwise/incautious/ugly to
> rely upon __packed in this manner.
> 
> 
> Please do s/__attribute__ ((__packed__))/__packed/g

Ok thanks.

> 
> > +	devid = le64_to_cpu(disk_super->dev_item.devid);
> > +	transid = btrfs_super_generation(disk_super);
> > +	if (disk_super->label[0])
> > +		printk(KERN_INFO "device label %s ", disk_super->label);
> > +	else {
> > +		/* FIXME, make a readl uuid parser */
> 
> what's that?

ack, should read a real uuid parser, in this case I just want to print
the first few bytes of the uuid.

> 
> > +		printk(KERN_INFO "device fsid %llx-%llx ",
> > +		       *(unsigned long long *)disk_super->fsid,
> > +		       *(unsigned long long *)(disk_super->fsid + 8));
> > +	}
> > +	printk(KERN_INFO "devid %llu transid %llu %s\n",
> > +	       (unsigned long long)devid, (unsigned long long)transid, path);
> > +	ret = device_list_add(path, disk_super, devid, fs_devices_ret);
> > +
> > +	brelse(bh);

> > +	/* FIXME use last free of some kind */
> > +
> > +	/* we don't want to overwrite the superblock on the drive,
> > +	 * so we make sure to start at an offset of at least 1MB
> > +	 */
> > +	search_start = max((u64)1024 * 1024, search_start);
> 
> max_t

ack

> 
> > +	if (root->fs_info->alloc_start + num_bytes <= device->total_bytes)
> > +		search_start = max(root->fs_info->alloc_start, search_start);
> > +
> > +	key.objectid = device->devid;
> > +	key.offset = search_start;
> > +	key.type = BTRFS_DEV_EXTENT_KEY;
> > +	ret = btrfs_search_slot(trans, root, &key, path, 0, 0);
> > +	if (ret < 0)
> > +		goto error;
> > +	ret = btrfs_previous_item(root, path, 0, key.type);
> > +	if (ret < 0)
> > +		goto error;
> > +	l = path->nodes[0];
> > +	btrfs_item_key_to_cpu(l, &key, path->slots[0]);
> > +	while (1) {
> > +		l = path->nodes[0];
> > +		slot = path->slots[0];
> > +		if (slot >= btrfs_header_nritems(l)) {
> > +			ret = btrfs_next_leaf(root, path);
> > +			if (ret == 0)
> > +				continue;
> > +			if (ret < 0)
> > +				goto error;
> > +no_more_items:
> > +			if (!start_found) {
> > +				if (search_start >= search_end) {
> > +					ret = -ENOSPC;
> 
> Hey.  Shouldn't that go BUG()?

Grin, this is the enospc we do handle.

> 
> > +					goto error;
> > +				}
> > +				*start = search_start;
> > +				start_found = 1;
> > +				goto check_pending;
> > +			}
> > +			*start = last_byte > search_start ?
> > +				last_byte : search_start;
> 
> max()?

ack

> > +check_pending:
> > +	/* we have to make sure we didn't find an extent that has already
> > +	 * been allocated by the map tree or the original allocation
> > +	 */
> > +	BUG_ON(*start < search_start);
> 
> Is it triggerable by on-disk corruption?

Yes, good point.  In general the corrupted FS checks are not yet done.

> <looks at btrfs_search_slot>
> 
>  * if ins_len > 0, nodes and leaves will be split as we walk down the
>  * tree.  if ins_len < 0, nodes will be merged as we walk down the tree (if
>  * possible)
> 
> The (ab)use of the sign of formal arg `ins_len' seems a bit nasty?

I'll make a real constant for the ins_len < 0, since the code uses -1
everywhere for that case.

> >
> > +static noinline int find_next_chunk(struct btrfs_root *root,
> > +				    u64 objectid, u64 *offset)
> > +{
> > +	struct btrfs_path *path;
> > +	int ret;
> > +	struct btrfs_key key;
> > +	struct btrfs_chunk *chunk;
> > +	struct btrfs_key found_key;
> > +
> > +	path = btrfs_alloc_path();
> > +	BUG_ON(!path);
> 
> Stuff like this will eventually have to be weeded out and fixed.  It at
> least needs a big FIXME the instant it is typed in.  Quietly slipping
> it in there in this manner is just creating more work for yourselves.

Andi has a big patch of fixes for many of these.

> > +	ptr = (unsigned long)btrfs_device_uuid(dev_item);
> > +	write_extent_buffer(leaf, device->uuid, ptr, BTRFS_UUID_SIZE);
> > +	ptr = (unsigned long)btrfs_device_fsid(dev_item);
> > +	write_extent_buffer(leaf, root->fs_info->fsid, ptr, BTRFS_UUID_SIZE);
> > +	btrfs_mark_buffer_dirty(leaf);
> 
> I expected a function called btrfs_mark_buffer_dirty() to be a wrapper
> around mark_buffer_dirty.

It gets back to the name of extent_buffers I suppose...I kept the name
similar to mark_buffer_dirty because you expect to find something that
marks the extent_buffer for writeback.

But btrfs_set_extent_buffer_dirty() is fine too

> 
> > +	ret = 0;
> > +out:
> > +	btrfs_free_path(path);
> > +	return ret;
> > +}
> > +
> >
> > ...
> >
> > +
> > +int btrfs_rm_device(struct btrfs_root *root, char *device_path)
> > +{
> > +	struct btrfs_device *device;
> > +	struct btrfs_device *next_device;
> > +	struct block_device *bdev;
> > +	struct buffer_head *bh = NULL;
> > +	struct btrfs_super_block *disk_super;
> > +	u64 all_avail;
> > +	u64 devid;
> > +	u64 num_devices;
> > +	u8 *dev_uuid;
> > +	int ret = 0;
> > +
> > +	mutex_lock(&uuid_mutex);
> 
> btrfs_lock_volumes()
> 
> > +	mutex_lock(&root->fs_info->volume_mutex);
> 
> lock_chunks() (which needs renaming)

Ack to get rid of lock_chunks()

> 
> > +	all_avail = root->fs_info->avail_data_alloc_bits |
> > +		root->fs_info->avail_system_alloc_bits |
> > +		root->fs_info->avail_metadata_alloc_bits;
> > +
> > +	if ((all_avail & BTRFS_BLOCK_GROUP_RAID10) &&
> > +	    root->fs_info->fs_devices->rw_devices <= 4) {
> > +		printk(KERN_ERR "btrfs: unable to go below four devices "
> > +		       "on raid10\n");
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	if ((all_avail & BTRFS_BLOCK_GROUP_RAID1) &&
> > +	    root->fs_info->fs_devices->rw_devices <= 2) {
> > +		printk(KERN_ERR "btrfs: unable to go below two "
> > +		       "devices on raid1\n");
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	if (strcmp(device_path, "missing") == 0) {
> 
> <wonders what interface all this is implementing>

Will comment to reflect this is how we remove a device that is found in
the FS metadata but not present in the system.  Picture a raid1 where
one disk has been ripped out or died.

> > +static int btrfs_prepare_sprout(struct btrfs_trans_handle *trans,
> > +				struct btrfs_root *root)
> > +{
> > +	struct btrfs_fs_devices *fs_devices = root->fs_info->fs_devices;
> > +	struct btrfs_fs_devices *old_devices;
> > +	struct btrfs_fs_devices *seed_devices;
> > +	struct btrfs_super_block *disk_super = &root->fs_info->super_copy;
> > +	struct btrfs_device *device;
> > +	u64 super_flags;
> > +
> > +	BUG_ON(!mutex_is_locked(&uuid_mutex));
> 
> Given the (regrettable?) existence of btrfs_lock_volumes() and
> btrfs_unlock_volumes(), it would make sense to add another regrettable
> wrapper for the above operation.

It makes sense to drop the wrappers.

...

> > +	list_for_each_entry(device, &seed_devices->devices, dev_list) {
> > +		device->fs_devices = seed_devices;
> > +	}
> 
> unneeded braces
> 
> <runs checkpatch>
> 
> <checkpatch misses this - regression?>
> 
> 
> #2001: FILE: fs/btrfs/volumes.c:1930:
> +               return calc_size * num_stripes;
>                                  ^
> 
> ERROR: space prohibited after that '*' (ctx:WxW)

checkpatch wants me to do calc_size *num_stripes, I think because it
thinks num_stripes is a pointer?

At any rate, I missed the braces because I was busy being grumpy about
the bogus * warning...will fix the braces ;)

> #2072: FILE: fs/btrfs/volumes.c:2001:
> +               max_chunk_size = calc_size * 2;
>                                            ^
> 
> ERROR: space prohibited after that '*' (ctx:WxW)
> #2089: FILE: fs/btrfs/volumes.c:2018:
> +       if (calc_size * num_stripes > max_chunk_size) {
> 	                      ^
> 	
> ERROR: space prohibited after that '*' (ctx:WxW)
> #2105: FILE: fs/btrfs/volumes.c:2034:
> +               min_free = calc_size * 2;
> 	                                     ^
> 
> hm, they're all wrong.	
> 

Maybe its the u64 that confuses checkpatch?

> >
> > +
> > +int btrfs_init_new_device(struct btrfs_root *root, char *device_path)
> > +{
...
> > +	bdev = open_bdev_exclusive(device_path, 0, root->fs_info->bdev_holder);
> > +	if (!bdev)
> > +		return -EIO;
> > +
> > +	if (root->fs_info->fs_devices->seeding) {
> > +		seeding_dev = 1;
> > +		down_write(&sb->s_umount);
> > +		mutex_lock(&uuid_mutex);
> > +	}
> > +
> > +	filemap_write_and_wait(bdev->bd_inode->i_mapping);
> 
> I'm surprised that this can be safely done under ->s_umount.

btrfs_init_new_device is only called by the add_dev ioctl.  That doesn't
happen with s_umount held.  Or do you mean we should be taking s_umount?

> > +
> > +	device->name = kstrdup(device_path, GFP_NOFS);
> > +	if (!device->name) {
> > +		kfree(device);
> > +		ret = -ENOMEM;
> > +		goto error;
> > +	}
> > +
> > +	ret = find_next_devid(root, &device->devid);
> > +	if (ret) {
> > +		kfree(device);
> 
> Leaks device->name

ack.

> 
> > +		goto error;
> > +	}
> > +
> > +	trans = btrfs_start_transaction(root, 1);
> > +	lock_chunks(root);
> > +
> > +	device->barriers = 1;
> > +	device->writeable = 1;
> > +	device->work.func = pending_bios_fn;
> 
> Should we be using INIT_WORK() here?  It has lockdep implications.
> (Multiple instances).

These are struct btrfs_work

> <attention span exceeded, sorry>

Thanks for all the comments!

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html