Re: Kernel mounted RBD's hanging

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> -----Original Message-----
> From: Ilya Dryomov [mailto:idryomov@xxxxxxxxx]
> Sent: 07 July 2017 11:32
> To: Nick Fisk <nick@xxxxxxxxxx>
> Cc: Ceph Users <ceph-users@xxxxxxxxxxxxxx>
> Subject: Re:  Kernel mounted RBD's hanging
> 
> On Fri, Jul 7, 2017 at 12:10 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> > Managed to catch another one, osd.75 again, not sure if that is an
> indication of anything or just a co-incidence. osd.75 is one of 8 OSD's in a
> cache tier, so all IO will be funnelled through them.
> >
> >
> >
> > Also found this in the log of osd.75 at the same time, but the client IP is not
> the same as the node which experienced the hang.
> 
> Can you bump debug_ms and debug_osd to 30 on osd75?  I doubt it's an
> issue with that particular OSD, but if it goes down the same way again, I'd
> have something to look at.  Make sure logrotate is configured and working
> before doing that though... ;)
> 
> Thanks,
> 
>                 Ilya

So, osd.75 was a coincidence, several other hangs have had outstanding requests to other OSD's. I haven't been able to get the debug logs of the OSD during a hang yet because of this. Although I think the crc problem may now be fixed, by upgrading all clients to 4.11.1+.

Here is a series of osdc dumps every minute during one of the hangs with a different target OSD. The osdc dumps on another node show IO being processed normally whilst the other node hangs, so the cluster is definitely handling IO fine whilst the other node hangs. And as I am using cache tiering with proxying, all IO will be going through just 8 OSD's. The host has 3 RBD's mounted and all 3 hang.

Latest hang:
Sat  8 Jul 18:49:01 BST 2017
REQUESTS 4 homeless 0
174662831       osd25   17.77737285     [25,74,14]/25   [25,74,14]/25   rbd_data.15d8670238e1f29.00000000000cf9f8       0x400024        1       0'0     set-alloc-hint,write
174662863       osd25   17.7b91a345     [25,74,14]/25   [25,74,14]/25   rbd_data.1555406238e1f29.000000000002571c       0x400024        1       0'0     set-alloc-hint,write
174662887       osd25   17.6c2eaa93     [25,75,14]/25   [25,75,14]/25   rbd_data.158f204238e1f29.0000000000000008       0x400024        1       0'0     set-alloc-hint,write
174662925       osd25   17.32271445     [25,74,14]/25   [25,74,14]/25   rbd_data.1555406238e1f29.0000000000000001       0x400024        1       0'0     set-alloc-hint,write
LINGER REQUESTS
18446462598732840990    osd74   17.145baa0f     [74,72,14]/74   [74,72,14]/74   rbd_header.158f204238e1f29      0x20    8       WC/0
18446462598732840991    osd74   17.7b4e2a06     [74,72,25]/74   [74,72,25]/74   rbd_header.1555406238e1f29      0x20    9       WC/0
18446462598732840992    osd74   17.eea94d58     [74,73,25]/74   [74,73,25]/74   rbd_header.15d8670238e1f29      0x20    8       WC/0
Sat  8 Jul 18:50:01 BST 2017
REQUESTS 5 homeless 0
174662831       osd25   17.77737285     [25,74,14]/25   [25,74,14]/25   rbd_data.15d8670238e1f29.00000000000cf9f8       0x400024        1       0'0     set-alloc-hint,write
174662863       osd25   17.7b91a345     [25,74,14]/25   [25,74,14]/25   rbd_data.1555406238e1f29.000000000002571c       0x400024        1       0'0     set-alloc-hint,write
174662887       osd25   17.6c2eaa93     [25,75,14]/25   [25,75,14]/25   rbd_data.158f204238e1f29.0000000000000008       0x400024        1       0'0     set-alloc-hint,write
174662925       osd25   17.32271445     [25,74,14]/25   [25,74,14]/25   rbd_data.1555406238e1f29.0000000000000001       0x400024        1       0'0     set-alloc-hint,write
174663129       osd25   17.32271445     [25,74,14]/25   [25,74,14]/25   rbd_data.1555406238e1f29.0000000000000001       0x400024        1       0'0     set-alloc-hint,write
LINGER REQUESTS
18446462598732840990    osd74   17.145baa0f     [74,72,14]/74   [74,72,14]/74   rbd_header.158f204238e1f29      0x20    8       WC/0
18446462598732840991    osd74   17.7b4e2a06     [74,72,25]/74   [74,72,25]/74   rbd_header.1555406238e1f29      0x20    9       WC/0
18446462598732840992    osd74   17.eea94d58     [74,73,25]/74   [74,73,25]/74   rbd_header.15d8670238e1f29      0x20    8       WC/0
Sat  8 Jul 18:51:01 BST 2017
REQUESTS 5 homeless 0
174662831       osd25   17.77737285     [25,74,14]/25   [25,74,14]/25   rbd_data.15d8670238e1f29.00000000000cf9f8       0x400024        1       0'0     set-alloc-hint,write
174662863       osd25   17.7b91a345     [25,74,14]/25   [25,74,14]/25   rbd_data.1555406238e1f29.000000000002571c       0x400024        1       0'0     set-alloc-hint,write
174662887       osd25   17.6c2eaa93     [25,75,14]/25   [25,75,14]/25   rbd_data.158f204238e1f29.0000000000000008       0x400024        1       0'0     set-alloc-hint,write
174662925       osd25   17.32271445     [25,74,14]/25   [25,74,14]/25   rbd_data.1555406238e1f29.0000000000000001       0x400024        1       0'0     set-alloc-hint,write
174663129       osd25   17.32271445     [25,74,14]/25   [25,74,14]/25   rbd_data.1555406238e1f29.0000000000000001       0x400024        1       0'0     set-alloc-hint,write
LINGER REQUESTS
18446462598732840990    osd74   17.145baa0f     [74,72,14]/74   [74,72,14]/74   rbd_header.158f204238e1f29      0x20    8       WC/0
18446462598732840991    osd74   17.7b4e2a06     [74,72,25]/74   [74,72,25]/74   rbd_header.1555406238e1f29      0x20    9       WC/0
18446462598732840992    osd74   17.eea94d58     [74,73,25]/74   [74,73,25]/74   rbd_header.15d8670238e1f29      0x20    8       WC/0
Sat  8 Jul 18:52:01 BST 2017
REQUESTS 6 homeless 0
174662831       osd25   17.77737285     [25,74,14]/25   [25,74,14]/25   rbd_data.15d8670238e1f29.00000000000cf9f8       0x400024        1       0'0     set-alloc-hint,write
174662863       osd25   17.7b91a345     [25,74,14]/25   [25,74,14]/25   rbd_data.1555406238e1f29.000000000002571c       0x400024        1       0'0     set-alloc-hint,write
174662887       osd25   17.6c2eaa93     [25,75,14]/25   [25,75,14]/25   rbd_data.158f204238e1f29.0000000000000008       0x400024        1       0'0     set-alloc-hint,write
174662925       osd25   17.32271445     [25,74,14]/25   [25,74,14]/25   rbd_data.1555406238e1f29.0000000000000001       0x400024        1       0'0     set-alloc-hint,write
174663129       osd25   17.32271445     [25,74,14]/25   [25,74,14]/25   rbd_data.1555406238e1f29.0000000000000001       0x400024        1       0'0     set-alloc-hint,write
174664149       osd25   17.b148df13     [25,75,14]/25   [25,75,14]/25   rbd_data.158f204238e1f29.0000000000091205       0x400024        1       0'0     set-alloc-hint,write
LINGER REQUESTS
18446462598732840990    osd74   17.145baa0f     [74,72,14]/74   [74,72,14]/74   rbd_header.158f204238e1f29      0x20    8       WC/0
18446462598732840991    osd74   17.7b4e2a06     [74,72,25]/74   [74,72,25]/74   rbd_header.1555406238e1f29      0x20    9       WC/0
18446462598732840992    osd74   17.eea94d58     [74,73,25]/74   [74,73,25]/74   rbd_header.15d8670238e1f29      0x20    8       WC/0

And continues on identically until 19:03

I realize at this stage, these reports are probably not revealing much more information, so I will report back if I can gather any further information from the OSD's. The problem does seem to be related to load or at least the number of RBD's mounted. The host that only has 2 RBD's mounted hardly experiences this problem at all.

Nick

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux