On Fri, Jan 31, 2020 at 6:32 PM Ilya Dryomov <idryomov@xxxxxxxxx> wrote: > > On Fri, Jan 31, 2020 at 4:57 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > > > > Hi Ilya, > > > > On Fri, Jan 31, 2020 at 11:33 AM Ilya Dryomov <idryomov@xxxxxxxxx> wrote: > > > > > > On Fri, Jan 31, 2020 at 11:06 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > > > > > > > > Hi all, > > > > > > > > We are quite regularly (a couple times per week) seeing: > > > > > > > > HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs > > > > report slow requests > > > > MDS_CLIENT_LATE_RELEASE 1 clients failing to respond to capability release > > > > mdshpc-be143(mds.0): Client hpc-be028.cern.ch: failing to respond > > > > to capability release client_id: 52919162 > > > > MDS_SLOW_REQUEST 1 MDSs report slow requests > > > > mdshpc-be143(mds.0): 1 slow requests are blocked > 30 secs > > > > > > > > Which is being caused by osdc ops stuck in a kernel client, e.g.: > > > > > > > > 10:57:18 root hpc-be028 /root > > > > → cat /sys/kernel/debug/ceph/4da6fd06-b069-49af-901f-c9513baabdbd.client52919162/osdc > > > > REQUESTS 9 homeless 0 > > > > 46559317 osd243 3.ee6ffcdb 3.cdb [243,501,92]/243 > > > > [243,501,92]/243 e678697 > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a01.00000057 > > > > 0x400014 1 read > > > > 46559322 osd243 3.ee6ffcdb 3.cdb [243,501,92]/243 > > > > [243,501,92]/243 e678697 > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a01.00000057 > > > > 0x400014 1 read > > > > 46559323 osd243 3.969cc573 3.573 [243,330,226]/243 > > > > [243,330,226]/243 e678697 > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056 > > > > 0x400014 1 read > > > > 46559341 osd243 3.969cc573 3.573 [243,330,226]/243 > > > > [243,330,226]/243 e678697 > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056 > > > > 0x400014 1 read > > > > 46559342 osd243 3.969cc573 3.573 [243,330,226]/243 > > > > [243,330,226]/243 e678697 > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056 > > > > 0x400014 1 read > > > > 46559345 osd243 3.969cc573 3.573 [243,330,226]/243 > > > > [243,330,226]/243 e678697 > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056 > > > > 0x400014 1 read > > > > 46559621 osd243 3.6313e8ef 3.8ef [243,330,521]/243 > > > > [243,330,521]/243 e678697 > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a45.0000007a > > > > 0x400014 1 read > > > > 46559629 osd243 3.b280c852 3.852 [243,113,539]/243 > > > > [243,113,539]/243 e678697 > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a3a.0000007f > > > > 0x400014 1 read > > > > 46559928 osd243 3.1ee7bab4 3.ab4 [243,332,94]/243 > > > > [243,332,94]/243 e678697 > > > > fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f099ff.0000073f > > > > 0x400024 1 write > > > > LINGER REQUESTS > > > > BACKOFFS > > > > > > > > > > > > We can unblock those requests by doing `ceph osd down osd.243` (or > > > > restarting osd.243). > > > > > > > > This is ceph v14.2.6 and the client kernel is el7 3.10.0-957.27.2.el7.x86_64. > > > > > > > > Are there a better way to debug this? > > > > > > Hi Dan, > > > > > > I assume that these ops don't show up as slow requests on the OSD side? > > > How long did you see it stuck for before intervening? > > > > That's correct -- the osd had no active ops (ceph daemon.... ops). > > > > The late release slow req was stuck for 4129s before we intervened. > > > > > Do you happen to have "debug ms = 1" logs from osd243? > > > > Nope, but I can try to get it afterwards next time. (Though you need > > it at the moment the ops get stuck, not only from the moment we notice > > the stuck ops, right?) > > Yes, starting before the moment the ops get stuck and ending after you > kick the OSD. > > > > > > Do you have PG autoscaler enabled? Any PG splits and/or merges at the time? > > > > Not on the cephfs_(meta)data pools (though on the 30th I increased > > those pool sizes from 2 to 3). And also on the 30th I did some PG > > merging on an unrelated test pool. > > And anyway we have seen this type of lockup in the past, without those > > pool changes (also with mimic MDS until we upgraded to nautilus). > > The MDS is out of question here. This issue is between the kernel > client and the OSD. > > > > > Looking back further in the client's kernel log we see a page alloc > > failure on the 30th: > > > > Jan 30 16:16:35 hpc-be028.cern.ch kernel: kworker/1:36: page > > allocation failure: order:5, mode:0x104050 > > Jan 30 16:16:35 hpc-be028.cern.ch kernel: CPU: 1 PID: 78445 Comm: > > kworker/1:36 Kdump: loaded Tainted: P > > Jan 30 16:16:35 hpc-be028.cern.ch kernel: Workqueue: ceph-msgr > > ceph_con_workfn [libceph] > > Can you share the stack trace? That's a 128k allocation, so worth > taking a look. > Pasted here: https://pastebin.com/neyah54k The same node had a lockup again last night. And we included the page alloc failure and a resulting dump of osdmap "corruption". -- Dan > Thanks, > > Ilya _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx