> -----Original Message----- > From: Ilya Dryomov [mailto:idryomov@xxxxxxxxx] > Sent: 07 July 2017 11:32 > To: Nick Fisk <nick@xxxxxxxxxx> > Cc: Ceph Users <ceph-users@xxxxxxxxxxxxxx> > Subject: Re: Kernel mounted RBD's hanging > > On Fri, Jul 7, 2017 at 12:10 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: > > Managed to catch another one, osd.75 again, not sure if that is an > indication of anything or just a co-incidence. osd.75 is one of 8 OSD's in a > cache tier, so all IO will be funnelled through them. > > > > > > > > Also found this in the log of osd.75 at the same time, but the client IP is not > the same as the node which experienced the hang. > > Can you bump debug_ms and debug_osd to 30 on osd75? I doubt it's an > issue with that particular OSD, but if it goes down the same way again, I'd > have something to look at. Make sure logrotate is configured and working > before doing that though... ;) > > Thanks, > > Ilya So, osd.75 was a coincidence, several other hangs have had outstanding requests to other OSD's. I haven't been able to get the debug logs of the OSD during a hang yet because of this. Although I think the crc problem may now be fixed, by upgrading all clients to 4.11.1+. Here is a series of osdc dumps every minute during one of the hangs with a different target OSD. The osdc dumps on another node show IO being processed normally whilst the other node hangs, so the cluster is definitely handling IO fine whilst the other node hangs. And as I am using cache tiering with proxying, all IO will be going through just 8 OSD's. The host has 3 RBD's mounted and all 3 hang. Latest hang: Sat 8 Jul 18:49:01 BST 2017 REQUESTS 4 homeless 0 174662831 osd25 17.77737285 [25,74,14]/25 [25,74,14]/25 rbd_data.15d8670238e1f29.00000000000cf9f8 0x400024 1 0'0 set-alloc-hint,write 174662863 osd25 17.7b91a345 [25,74,14]/25 [25,74,14]/25 rbd_data.1555406238e1f29.000000000002571c 0x400024 1 0'0 set-alloc-hint,write 174662887 osd25 17.6c2eaa93 [25,75,14]/25 [25,75,14]/25 rbd_data.158f204238e1f29.0000000000000008 0x400024 1 0'0 set-alloc-hint,write 174662925 osd25 17.32271445 [25,74,14]/25 [25,74,14]/25 rbd_data.1555406238e1f29.0000000000000001 0x400024 1 0'0 set-alloc-hint,write LINGER REQUESTS 18446462598732840990 osd74 17.145baa0f [74,72,14]/74 [74,72,14]/74 rbd_header.158f204238e1f29 0x20 8 WC/0 18446462598732840991 osd74 17.7b4e2a06 [74,72,25]/74 [74,72,25]/74 rbd_header.1555406238e1f29 0x20 9 WC/0 18446462598732840992 osd74 17.eea94d58 [74,73,25]/74 [74,73,25]/74 rbd_header.15d8670238e1f29 0x20 8 WC/0 Sat 8 Jul 18:50:01 BST 2017 REQUESTS 5 homeless 0 174662831 osd25 17.77737285 [25,74,14]/25 [25,74,14]/25 rbd_data.15d8670238e1f29.00000000000cf9f8 0x400024 1 0'0 set-alloc-hint,write 174662863 osd25 17.7b91a345 [25,74,14]/25 [25,74,14]/25 rbd_data.1555406238e1f29.000000000002571c 0x400024 1 0'0 set-alloc-hint,write 174662887 osd25 17.6c2eaa93 [25,75,14]/25 [25,75,14]/25 rbd_data.158f204238e1f29.0000000000000008 0x400024 1 0'0 set-alloc-hint,write 174662925 osd25 17.32271445 [25,74,14]/25 [25,74,14]/25 rbd_data.1555406238e1f29.0000000000000001 0x400024 1 0'0 set-alloc-hint,write 174663129 osd25 17.32271445 [25,74,14]/25 [25,74,14]/25 rbd_data.1555406238e1f29.0000000000000001 0x400024 1 0'0 set-alloc-hint,write LINGER REQUESTS 18446462598732840990 osd74 17.145baa0f [74,72,14]/74 [74,72,14]/74 rbd_header.158f204238e1f29 0x20 8 WC/0 18446462598732840991 osd74 17.7b4e2a06 [74,72,25]/74 [74,72,25]/74 rbd_header.1555406238e1f29 0x20 9 WC/0 18446462598732840992 osd74 17.eea94d58 [74,73,25]/74 [74,73,25]/74 rbd_header.15d8670238e1f29 0x20 8 WC/0 Sat 8 Jul 18:51:01 BST 2017 REQUESTS 5 homeless 0 174662831 osd25 17.77737285 [25,74,14]/25 [25,74,14]/25 rbd_data.15d8670238e1f29.00000000000cf9f8 0x400024 1 0'0 set-alloc-hint,write 174662863 osd25 17.7b91a345 [25,74,14]/25 [25,74,14]/25 rbd_data.1555406238e1f29.000000000002571c 0x400024 1 0'0 set-alloc-hint,write 174662887 osd25 17.6c2eaa93 [25,75,14]/25 [25,75,14]/25 rbd_data.158f204238e1f29.0000000000000008 0x400024 1 0'0 set-alloc-hint,write 174662925 osd25 17.32271445 [25,74,14]/25 [25,74,14]/25 rbd_data.1555406238e1f29.0000000000000001 0x400024 1 0'0 set-alloc-hint,write 174663129 osd25 17.32271445 [25,74,14]/25 [25,74,14]/25 rbd_data.1555406238e1f29.0000000000000001 0x400024 1 0'0 set-alloc-hint,write LINGER REQUESTS 18446462598732840990 osd74 17.145baa0f [74,72,14]/74 [74,72,14]/74 rbd_header.158f204238e1f29 0x20 8 WC/0 18446462598732840991 osd74 17.7b4e2a06 [74,72,25]/74 [74,72,25]/74 rbd_header.1555406238e1f29 0x20 9 WC/0 18446462598732840992 osd74 17.eea94d58 [74,73,25]/74 [74,73,25]/74 rbd_header.15d8670238e1f29 0x20 8 WC/0 Sat 8 Jul 18:52:01 BST 2017 REQUESTS 6 homeless 0 174662831 osd25 17.77737285 [25,74,14]/25 [25,74,14]/25 rbd_data.15d8670238e1f29.00000000000cf9f8 0x400024 1 0'0 set-alloc-hint,write 174662863 osd25 17.7b91a345 [25,74,14]/25 [25,74,14]/25 rbd_data.1555406238e1f29.000000000002571c 0x400024 1 0'0 set-alloc-hint,write 174662887 osd25 17.6c2eaa93 [25,75,14]/25 [25,75,14]/25 rbd_data.158f204238e1f29.0000000000000008 0x400024 1 0'0 set-alloc-hint,write 174662925 osd25 17.32271445 [25,74,14]/25 [25,74,14]/25 rbd_data.1555406238e1f29.0000000000000001 0x400024 1 0'0 set-alloc-hint,write 174663129 osd25 17.32271445 [25,74,14]/25 [25,74,14]/25 rbd_data.1555406238e1f29.0000000000000001 0x400024 1 0'0 set-alloc-hint,write 174664149 osd25 17.b148df13 [25,75,14]/25 [25,75,14]/25 rbd_data.158f204238e1f29.0000000000091205 0x400024 1 0'0 set-alloc-hint,write LINGER REQUESTS 18446462598732840990 osd74 17.145baa0f [74,72,14]/74 [74,72,14]/74 rbd_header.158f204238e1f29 0x20 8 WC/0 18446462598732840991 osd74 17.7b4e2a06 [74,72,25]/74 [74,72,25]/74 rbd_header.1555406238e1f29 0x20 9 WC/0 18446462598732840992 osd74 17.eea94d58 [74,73,25]/74 [74,73,25]/74 rbd_header.15d8670238e1f29 0x20 8 WC/0 And continues on identically until 19:03 I realize at this stage, these reports are probably not revealing much more information, so I will report back if I can gather any further information from the OSD's. The problem does seem to be related to load or at least the number of RBD's mounted. The host that only has 2 RBD's mounted hardly experiences this problem at all. Nick _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com