On Fri, Jul 7, 2017 at 12:10 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: > Managed to catch another one, osd.75 again, not sure if that is an indication of anything or just a co-incidence. osd.75 is one of 8 OSD's in a cache tier, so all IO will be funnelled through them. > > > cat /sys/kernel/debug/ceph/d027d580-d69d-48f4-9d28-9b1650b57cce.client31443905/osdc > REQUESTS 13 homeless 0 > 130947221 osd75 17.dbb45597 [75,73,25]/75 [75,73,25]/75 rbd_data.158f204238e1f29.0000000000080171 0x400024 1 0'0 set-alloc-hint,write > 130947226 osd75 17.4f47f0c3 [75,14,72]/75 [75,14,72]/75 rbd_data.1555406238e1f29.000000000007c8a9 0x400024 1 0'0 set-alloc-hint,write > 130947231 osd75 17.a184a1cc [75,72,3]/75 [75,72,3]/75 rbd_data.15d8670238e1f29.0000000000064054 0x400024 1 0'0 set-alloc-hint,write > 130947274 osd75 17.4d83ed0c [75,72,3]/75 [75,72,3]/75 rbd_data.1555406238e1f29.000000000007ccc1 0x400024 1 0'0 set-alloc-hint,write > 130947349 osd75 17.dbb45597 [75,73,25]/75 [75,73,25]/75 rbd_data.158f204238e1f29.0000000000080171 0x400024 1 0'0 set-alloc-hint,write > 130947421 osd75 17.32207383 [75,14,72]/75 [75,14,72]/75 rbd_data.15d8670238e1f29.0000000000000000 0x400024 1 0'0 set-alloc-hint,write > 130947472 osd75 17.dbb45597 [75,73,25]/75 [75,73,25]/75 rbd_data.158f204238e1f29.0000000000080171 0x400024 1 0'0 set-alloc-hint,write > 130947474 osd75 17.32207383 [75,14,72]/75 [75,14,72]/75 rbd_data.15d8670238e1f29.0000000000000000 0x400024 1 0'0 set-alloc-hint,write > 130947689 osd75 17.dbb45597 [75,73,25]/75 [75,73,25]/75 rbd_data.158f204238e1f29.0000000000080171 0x400024 1 0'0 set-alloc-hint,write > 130947740 osd75 17.dbb45597 [75,73,25]/75 [75,73,25]/75 rbd_data.158f204238e1f29.0000000000080171 0x400024 1 0'0 set-alloc-hint,write > 130947783 osd75 17.dbb45597 [75,73,25]/75 [75,73,25]/75 rbd_data.158f204238e1f29.0000000000080171 0x400024 1 0'0 set-alloc-hint,write > 130947826 osd75 17.dbb45597 [75,73,25]/75 [75,73,25]/75 rbd_data.158f204238e1f29.0000000000080171 0x400024 1 0'0 set-alloc-hint,write > 130947868 osd75 17.dbb45597 [75,73,25]/75 [75,73,25]/75 rbd_data.158f204238e1f29.0000000000080171 0x400024 1 0'0 set-alloc-hint,write > LINGER REQUESTS > 18446462598732840990 osd74 17.145baa0f [74,72,14]/74 [74,72,14]/74 rbd_header.158f204238e1f29 0x20 0 WC/0 > 18446462598732840991 osd74 17.7b4e2a06 [74,72,25]/74 [74,72,25]/74 rbd_header.1555406238e1f29 0x20 0 WC/0 > 18446462598732840992 osd74 17.eea94d58 [74,73,25]/74 [74,73,25]/74 rbd_header.15d8670238e1f29 0x20 0 WC/0 > > Also found this in the log of osd.75 at the same time, but the client IP is not the same as the node which experienced the hang. Can you bump debug_ms and debug_osd to 30 on osd75? I doubt it's an issue with that particular OSD, but if it goes down the same way again, I'd have something to look at. Make sure logrotate is configured and working before doing that though... ;) Thanks, Ilya _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com