Hi Jeff, I believe these are normal, they are just the connections IDLE timing out to the OSD's because no traffic has flowed recently. They are probably a symptom rather than a cause. Nick > -----Original Message----- > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of > Jeff Epstein > Sent: 23 April 2015 15:19 > To: Lionel Bouton; Christian Balzer > Cc: ceph-users@xxxxxxxxxxxxxx > Subject: Re: long blocking with writes on rbds > > The appearance of these socket closed messages seems to coincide with the > slowdown symptoms. What is the cause? > > 2015-04-23T14:08:47.111838+00:00 i-65062482 kernel: [ 4229.485489] libceph: > osd1 192.168.160.4:6800 socket closed (con state OPEN) > > 2015-04-23T14:09:06.961823+00:00 i-65062482 kernel: [ 4249.332547] libceph: > osd2 192.168.96.4:6800 socket closed (con state OPEN) > > 2015-04-23T14:09:09.701819+00:00 i-65062482 kernel: [ 4252.070594] libceph: > osd4 192.168.64.4:6800 socket closed (con state OPEN) > > 2015-04-23T14:09:10.381817+00:00 i-65062482 kernel: [ 4252.755400] libceph: > osd5 192.168.128.4:6800 socket closed (con state OPEN) > > 2015-04-23T14:09:14.831817+00:00 i-65062482 kernel: [ 4257.200257] libceph: > osd5 192.168.128.4:6800 socket closed (con state OPEN) > > 2015-04-23T14:13:57.061877+00:00 i-65062482 kernel: [ 4539.431624] libceph: > osd4 192.168.64.4:6800 socket closed (con state OPEN) > > 2015-04-23T14:13:57.541842+00:00 i-65062482 kernel: [ 4539.913284] libceph: > osd5 192.168.128.4:6800 socket closed (con state OPEN) > > 2015-04-23T14:13:59.801822+00:00 i-65062482 kernel: [ 4542.177187] libceph: > osd3 192.168.0.4:6800 socket closed (con state OPEN) > > 2015-04-23T14:14:11.361819+00:00 i-65062482 kernel: [ 4553.733566] libceph: > osd4 192.168.64.4:6800 socket closed (con state OPEN) > > 2015-04-23T14:14:47.871829+00:00 i-65062482 kernel: [ 4590.242136] libceph: > osd5 192.168.128.4:6800 socket closed (con state OPEN) > > 2015-04-23T14:14:47.991826+00:00 i-65062482 kernel: [ 4590.364078] libceph: > osd2 192.168.96.4:6800 socket closed (con state OPEN) > > 2015-04-23T14:15:00.081817+00:00 i-65062482 kernel: [ 4602.452980] libceph: > osd5 192.168.128.4:6800 socket closed (con state OPEN) > > 2015-04-23T14:16:21.301820+00:00 i-65062482 kernel: [ 4683.671614] libceph: > osd5 192.168.128.4:6800 socket closed (con state OPEN) > > > > Jeff > > On 04/23/2015 12:26 AM, Jeff Epstein wrote: > > > >>>> Do you have some idea how I can diagnose this problem? > >>> > >>> I'll look at ceph -s output while you get these stuck process to see > >>> if there's any unusual activity (scrub/deep > >>> scrub/recovery/bacfills/...). Is it correlated in any way with rbd > >>> removal (ie: write blocking don't appear unless you removed at least > >>> one rbd for say one hour before the write performance problems). > >> > >> I'm not familiar with Amazon VMs. If you map the rbds using the > >> kernel driver to local block devices do you have control over the > >> kernel you run (I've seen reports of various problems with older > >> kernels and you probably want the latest possible) ? > > > > ceph status shows nothing unusual. However, on the problematic node, > > we typically see entries in ps like this: > > > > 1468 12329 root D 0.0 mkfs.ext4 wait_on_page_bit > > 1468 12332 root D 0.0 mkfs.ext4 wait_on_buffer > > > > Notice the "D" blocking state. Here, mkfs is stopped on some wait > > functions for long periods of time. (Also, we are formatting the RBDs > > as ext4 even though the OSDs are xfs; I assume this shouldn't be a > > problem?) > > > > We're on kernel 3.18.4pl2, which is pretty recent. Still, an outdated > > kernel driver isn't out of the question; if anyone has any concrete > > information, I'd be grateful. > > > > Jeff > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com