On 12/11/2012 06:37 PM, Liu Bo wrote: > On Tue, Dec 11, 2012 at 09:33:15AM -0700, Jim Schutt wrote: >> On 12/09/2012 07:04 AM, Liu Bo wrote: >>> On Wed, Dec 05, 2012 at 09:07:05AM -0700, Jim Schutt wrote: >>> Hi Jim, >>> >>> Could you please apply the following patch to test if it works? >> >> Hi, >> >> So far, with your patch applied I've been unable to reproduce >> the recursive deadlock. Thanks a lot for this patch! >> This issue has been troubling me for a while. > > Hi Jim, > > Good news for us :) > >> >> I've been trying to learn more about btrfs internals - >> if you have the time to answer a couple questions about >> your patch, I'd really appreciate it. > > See below. > >> >>> >>> (It's against 3.7-rc8.) >>> >>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >>> index 3d3e2c1..100289b 100644 >>> --- a/fs/btrfs/extent-tree.c >>> +++ b/fs/btrfs/extent-tree.c >>> @@ -3346,7 +3346,8 @@ u64 btrfs_get_alloc_profile(struct btrfs_root >>> *root, int data) >>> >>> if (data) >>> flags = BTRFS_BLOCK_GROUP_DATA; >>> - else if (root == root->fs_info->chunk_root) >>> + else if (root == root->fs_info->chunk_root || >>> + root == root->fs_info->dev_root) >>> flags = BTRFS_BLOCK_GROUP_SYSTEM; >>> else >>> flags = BTRFS_BLOCK_GROUP_METADATA; >>> @@ -3535,6 +3536,7 @@ static u64 get_system_chunk_thresh(struct >>> btrfs_root *root, u64 type) >>> num_dev = 1; /* DUP or single */ >>> >>> /* metadata for updaing devices and chunk tree */ >>> + num_dev = num_dev << 1 >> >> AFAICS this is doubling the size of the reserve, which >> reduces the chance of a recursive do_chunk_alloc(), right? >> > > Not like that, we hit the deadlock because updating device tree also > uses METADATA chunk, which may be called when we're actually allocating > a METADATA chunk, so the patch I sent you makes updating device tree > use SYSTEM chunk, which we'll have some code to check if it is enough > before allocating a chunk(if not, we'll allocate a SYSTEM chunk first). > > Here I double the size just because the worst case of allocating a > DATA/METADATA chunk -may- results in > > 1)adding a SYSTEM chunk + > 2)adding dev extent per chunk stripe + > 3)updating chunk stripes's bytes_used > >>> return btrfs_calc_trans_metadata_size(root, num_dev + 1); >> >> btrfs_calc_trans_metadata_size(root, num_items) multiplies its >> num_items argument by another factor of three - do you know if >> there is there some rationale behind that number, or is it >> perhaps also an empirically determined factor? > > The height of Btrfs's metadata btree is at most 8, > leaf is on 0 level while node is at most on 7 level. > > Each btree update may results in COWing a node and its sibling nodes, > where the factor of tree comes from > >> >> What I'm wondering about is that if the size of the reserve is >> empirically determined, will it need to be increased again >> later when machines are more capable, and can handle a higher >> load? >> >> Do you think it's feasible to modify the locking for >> do_chunk_alloc to allow it to recurse without deadlock? > > Well, it could be, but IMO it'll bring us complexity, so worse > maintainance. > > Any questions? Feel free to ask. Your response was very helpful. Thanks a lot! -- Jim > > thanks, > liubo > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html