On Tue, Dec 11, 2012 at 09:33:15AM -0700, Jim Schutt wrote: > On 12/09/2012 07:04 AM, Liu Bo wrote: > > On Wed, Dec 05, 2012 at 09:07:05AM -0700, Jim Schutt wrote: > > Hi Jim, > > > > Could you please apply the following patch to test if it works? > > Hi, > > So far, with your patch applied I've been unable to reproduce > the recursive deadlock. Thanks a lot for this patch! > This issue has been troubling me for a while. Hi Jim, Good news for us :) > > I've been trying to learn more about btrfs internals - > if you have the time to answer a couple questions about > your patch, I'd really appreciate it. See below. > > > > > (It's against 3.7-rc8.) > > > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > > index 3d3e2c1..100289b 100644 > > --- a/fs/btrfs/extent-tree.c > > +++ b/fs/btrfs/extent-tree.c > > @@ -3346,7 +3346,8 @@ u64 btrfs_get_alloc_profile(struct btrfs_root > > *root, int data) > > > > if (data) > > flags = BTRFS_BLOCK_GROUP_DATA; > > - else if (root == root->fs_info->chunk_root) > > + else if (root == root->fs_info->chunk_root || > > + root == root->fs_info->dev_root) > > flags = BTRFS_BLOCK_GROUP_SYSTEM; > > else > > flags = BTRFS_BLOCK_GROUP_METADATA; > > @@ -3535,6 +3536,7 @@ static u64 get_system_chunk_thresh(struct > > btrfs_root *root, u64 type) > > num_dev = 1; /* DUP or single */ > > > > /* metadata for updaing devices and chunk tree */ > > + num_dev = num_dev << 1 > > AFAICS this is doubling the size of the reserve, which > reduces the chance of a recursive do_chunk_alloc(), right? > Not like that, we hit the deadlock because updating device tree also uses METADATA chunk, which may be called when we're actually allocating a METADATA chunk, so the patch I sent you makes updating device tree use SYSTEM chunk, which we'll have some code to check if it is enough before allocating a chunk(if not, we'll allocate a SYSTEM chunk first). Here I double the size just because the worst case of allocating a DATA/METADATA chunk -may- results in 1)adding a SYSTEM chunk + 2)adding dev extent per chunk stripe + 3)updating chunk stripes's bytes_used > > return btrfs_calc_trans_metadata_size(root, num_dev + 1); > > btrfs_calc_trans_metadata_size(root, num_items) multiplies its > num_items argument by another factor of three - do you know if > there is there some rationale behind that number, or is it > perhaps also an empirically determined factor? The height of Btrfs's metadata btree is at most 8, leaf is on 0 level while node is at most on 7 level. Each btree update may results in COWing a node and its sibling nodes, where the factor of tree comes from > > What I'm wondering about is that if the size of the reserve is > empirically determined, will it need to be increased again > later when machines are more capable, and can handle a higher > load? > > Do you think it's feasible to modify the locking for > do_chunk_alloc to allow it to recurse without deadlock? Well, it could be, but IMO it'll bring us complexity, so worse maintainance. Any questions? Feel free to ask. thanks, liubo -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html