Yan, I'll use this trick next time around. I did dump the kernel stacks for my process. 4 threads were blocked on SYS_newfstat (and the mds request further up the stack). I ended up restarting MDS after a few hours of trying to track it down. It resolved it self following that. This machine is running a pretty recent kernel on there -- 3.12 + merged testing ceph -- the other machines are running a slightly older 3.12-rc. I have observed the issue previously on random nodes infrequently for months (maybe once a week or two). Thanks again, - Milosz On Tue, Nov 19, 2013 at 8:18 PM, Yan, Zheng <ukernel@xxxxxxxxx> wrote: > On Wed, Nov 20, 2013 at 12:05 AM, Milosz Tanski <milosz@xxxxxxxxx> wrote: >> Yan and Sage, >> >> I've ran into this issue again on my test cluster. The client hangs >> all requests for a particular inode, I did a dump cache to see what's >> going... but I don't understand to enough to be able to read this line >> well enough. >> >> Can you guys help me read this, so I can further track down and >> hopefully fix this issue. >> >> [inode 10000346eed [2,head] >> /petabucket/beta/17511b3d12466609785b6a0e34597431721d177240371c0a1a4e347a1605381b/advertiser_id.dict >> auth v214 ap=5+0 dirtyparent s=925 n(v0 b925 1=1+0) (ifile sync->mix) >> (iversion lock) cr={59947=0-4194304@1} >> caps={59947=pAsLsXsFr/pAsxXsxFxwb@26,60001=pAsLsXsFr/-@1,60655=pAsLsXsFr/pAsLsXsFscr/pFscr@36} >> | ptrwaiter=0 request=4 lock=1 caps=1 dirtyparent=1 dirty=1 waiter=1 >> authpin=1 0x17dd6b70] >> >> root@bnode-16a1ed7d:~# cat >> /sys/kernel/debug/ceph/e23a1bfc-8328-46bf-bc59-1209df3f5434.client60655/mdsc >> 15659 mds0 getattr #10000346eed >> 15679 mds0 getattr #10000346eed >> 15710 mds0 getattr #10000346eed >> 15922 mds0 getattr #10000346eed > > which kernel do you use? is there any blocked process (echo w > > /proc/sysrq-trigger) on client.60655 ? 3.12 kernel contains few fixes > for similar hang. > > Regards > Yan, Zheng > >> >> On Wed, Nov 6, 2013 at 10:01 AM, Yan, Zheng <ukernel@xxxxxxxxx> wrote: >>> On Wed, Nov 6, 2013 at 9:41 PM, Milosz Tanski <milosz@xxxxxxxxx> wrote: >>>> Sage, >>>> >>>> I think the incrementing version counter on the whole is a neater >>>> solution then using size and mtime. If nothing else it's more explicit >>>> in the the read cache version. With what you suggested plus additional >>>> changes to the open code (where the cookie gets created) the >>>> write-through scenario should be correct. >>>> >>>> Sadly, my understanding of the MDS protocol is still not great. So >>>> when doing this in the first place I erred on the side of using what >>>> was already in place. >>>> >>>> In a kind of un-related question. Is there a debug hook in the kclient >>>> (or MDS for that matter) to dump the current file inodes (names) with >>>> issues caps and to which hosts. This would be very helpful for >>>> debugging, since from time to time I see a one of the clients get >>>> stuck in getattr (via mdsc debug log). >>>> >>> >>> "ceph mds tell \* dumpcache" dump the mds cache to a file. the dump >>> file contains caps information. >>> >>> Regards >>> Yan, Zheng >>> >>>> Thanks, >>>> - Milosz >>>> >>>> On Tue, Nov 5, 2013 at 6:56 PM, Sage Weil <sage@xxxxxxxxxxx> wrote: >>>>> On Tue, 5 Nov 2013, Milosz Tanski wrote: >>>>>> Li, >>>>>> >>>>>> First, sorry for the late reply on this. >>>>>> >>>>>> Currently fscache is only supported for files that are open in read >>>>>> only mode. I originally was going to let fscache cache in the write >>>>>> path as well as long as the file was open in with O_LAZY. I abandoned >>>>>> that idea. When a user opens the file in O_LAZY we can cache things >>>>>> locally with the assumption that the user will care of the >>>>>> synchronization in some other manner. There is no way of invalidating >>>>>> a subset of the pages in object cached by fscache, there is no way we >>>>>> can make O_LAZY work well. >>>>>> >>>>>> The ceph_readpage_to_fscache() in writepage has no effect and it >>>>>> should be removed. ceph_readpage_to_fscache() calls cache_valid() to >>>>>> see if it should perform the page save, and since the file can't have >>>>>> a CACHE cap at the point in time it doesn't do it. >>>>> >>>>> (Hmm, Dusting off my understanding of fscache and reading >>>>> fs/ceph/cache.c; watch out!) It looks like cache_valid is >>>>> >>>>> static inline int cache_valid(struct ceph_inode_info *ci) >>>>> { >>>>> return ((ceph_caps_issued(ci) & CEPH_CAP_FILE_CACHE) && >>>>> (ci->i_fscache_gen == ci->i_rdcache_gen)); >>>>> } >>>>> >>>>> and in the FILE_EXCL case, the MDS will issue CACHE|BUFFER caps. But I >>>>> think the aux key (size+mtime) will prevent any use of the cache as soon >>>>> as the first write happens and mtime changes, right? >>>>> >>>>> I think that in order to make this work, we need to fix/create a >>>>> file_version (or something similar) field in the (mds) inode_t to have >>>>> some useful value. I.e., increment it any time >>>>> >>>>> - a different client/writer comes along >>>>> - a file is modified by the mds (e.g., truncated or recovered) >>>>> >>>>> but allow it to otherwise remain the same as long as only a single client >>>>> is working with the file exclusively. This will be more precise than the >>>>> (size, mtime) check that is currently used, and would remain valid when a >>>>> single client opens the same file for exclusive read/write multiple times >>>>> but there are no other intervening changes. >>>>> >>>>> Milosz, if that were in place, is there any reason not to wire up >>>>> writepage and allow the fscache to be used write-through? >>>>> >>>>> sage >>>>> >>>>> >>>>> >>>>> >>>>>> >>>>>> Thanks, >>>>>> - Milosz >>>>>> >>>>>> On Thu, Oct 31, 2013 at 11:56 PM, Li Wang <liwang@xxxxxxxxxxxxxxx> wrote: >>>>>> > Currently, the pages in fscache only are updated in writepage() path, >>>>>> > add the process in writepages(). >>>>>> > >>>>>> > Signed-off-by: Min Chen <minchen@xxxxxxxxxxxxxxx> >>>>>> > Signed-off-by: Li Wang <liwang@xxxxxxxxxxxxxxx> >>>>>> > Signed-off-by: Yunchuan Wen <yunchuanwen@xxxxxxxxxxxxxxx> >>>>>> > --- >>>>>> > fs/ceph/addr.c | 8 +++++--- >>>>>> > 1 file changed, 5 insertions(+), 3 deletions(-) >>>>>> > >>>>>> > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c >>>>>> > index 6df8bd4..cc57911 100644 >>>>>> > --- a/fs/ceph/addr.c >>>>>> > +++ b/fs/ceph/addr.c >>>>>> > @@ -746,7 +746,7 @@ retry: >>>>>> > >>>>>> > while (!done && index <= end) { >>>>>> > int num_ops = do_sync ? 2 : 1; >>>>>> > - unsigned i; >>>>>> > + unsigned i, j; >>>>>> > int first; >>>>>> > pgoff_t next; >>>>>> > int pvec_pages, locked_pages; >>>>>> > @@ -894,7 +894,6 @@ get_more_pages: >>>>>> > if (!locked_pages) >>>>>> > goto release_pvec_pages; >>>>>> > if (i) { >>>>>> > - int j; >>>>>> > BUG_ON(!locked_pages || first < 0); >>>>>> > >>>>>> > if (pvec_pages && i == pvec_pages && >>>>>> > @@ -924,7 +923,10 @@ get_more_pages: >>>>>> > >>>>>> > osd_req_op_extent_osd_data_pages(req, 0, pages, len, 0, >>>>>> > !!pool, false); >>>>>> > - >>>>>> > + for(j = 0; j < locked_pages; j++) { >>>>>> > + struct page *page = pages[j]; >>>>>> > + ceph_readpage_to_fscache(inode, page); >>>>>> > + } >>>>>> > pages = NULL; /* request message now owns the pages array */ >>>>>> > pool = NULL; >>>>>> > >>>>>> > -- >>>>>> > 1.7.9.5 >>>>>> > >>>>>> > -- >>>>>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>>>> > the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>> > More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Milosz Tanski >>>>>> CTO >>>>>> 10 East 53rd Street, 37th floor >>>>>> New York, NY 10022 >>>>>> >>>>>> p: 646-253-9055 >>>>>> e: milosz@xxxxxxxxx >>>>>> >>>>>> >>>> >>>> >>>> >>>> -- >>>> Milosz Tanski >>>> CTO >>>> 10 East 53rd Street, 37th floor >>>> New York, NY 10022 >>>> >>>> p: 646-253-9055 >>>> e: milosz@xxxxxxxxx >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> Please read the FAQ at http://www.tux.org/lkml/ >> >> >> >> -- >> Milosz Tanski >> CTO >> 10 East 53rd Street, 37th floor >> New York, NY 10022 >> >> p: 646-253-9055 >> e: milosz@xxxxxxxxx -- Milosz Tanski CTO 10 East 53rd Street, 37th floor New York, NY 10022 p: 646-253-9055 e: milosz@xxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html