Wido, Sorry we lost track of this last week — we were all distracted by FAST 12! :) So it looks like they're both on the same map and osd.4 is sending pings to osd.19, but osd.19 is just ignoring them? Or do you really have on debug_os and not debug_osd? :) -Greg On Wed, Feb 15, 2012 at 6:12 AM, Wido den Hollander <wido@xxxxxxxxx> wrote: > Hi, > > > On 08/10/2011 12:30 PM, Wido den Hollander wrote: >> >> Hi, >> >> I tried reproducing this today with the new heartbeat code, couldn't. >> >> The new heartbeat code seems to work much better under load, I haven't >> seen any bouncing like I saw before. > > > I'm still observing this bouncing, after a couple of days running my cluster > the OSD's started to say they are down again. > > [INF] osd.19 [2a00:f10:11b:cef0:225:90ff:fe33:49cc]:6809/14435 failed (by > osd.4 [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6800/15389 > > [INF] osd.8 [2a00:f10:11b:cef0:225:90ff:fe33:49f2]:6806/15035 failed (by > osd.25 [2a00:f10:11b:cef0:225:90ff:fe33:49ca]:6803/14344 > > [INF] osd.22 [2a00:f10:11b:cef0:225:90ff:fe33:497c]:6806/14527 failed (by > osd.13 [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:6803/14188 > > It's the same again, a couple of OSD's don't like eachother anymore and > start marking the other as down. > > My ceph.conf: http://zooi.widodh.nl/ceph/ceph.conf > > I'm taking osd.19 and osd.4 here as example. > > Some logs on Pastebin: http://pastebin.com/EYUHbPua > > All systems are idle, so this can't be an issue of limited CPU cycles. > > As this seems to keep coming back, what steps should we take next? > > debug_ms and os are on 20. > > What I noticed is that osd.4 never says: "osd.19 already has epoch 19", but > that is I guess since it thinks osd.19 is down? > > On osd.4 I never see I connection coming in from osd.19. I'm trying to > figure out which of the two is wrong here. > > Any suggestions? > > Wido > > >> >> Wido >> >> On Fri, 2011-08-05 at 14:11 +0200, Wido den Hollander wrote: >>> >>> Hi, >>> >>> I'm seeing something weird which I don't want to report in a issue yet >>> in the tracker. >>> >>> I suffered a pretty big crash which caused me to loose 15 of my 40 OSDs. >>> Some died because my WD Green 2TB disks failed, others since btrfs >>> didn't survive the hard reboot I had to give them. >>> >>> When watching 'ceph -w' I noticed this: >>> >>> 2011-08-05 13:32:03.209958 pg v504313: 8008 pgs: 1 inactive, 334 >>> active+clean, 10 peering, 1 crashed+peering, 14 down+peering, 7 active >>> +clean+degraded, 7624 crashed+down+peering, 8 down+degraded+peering, 9 >>> crashed+down+degraded+peering; 4417 GB data, 13121 GB used, 32504 GB / >>> 46575 GB avail; 8255/3391107 degraded (0.243%) >>> 2011-08-05 13:32:03.225712 mds e6033: 1/1/1 up {0=dga=up:replay(laggy >>> or crashed)} >>> 2011-08-05 13:32:03.225747 osd e9934: 40 osds: 25 up, 25 in >>> 2011-08-05 13:32:03.225874 log 2011-08-05 13:32:01.646596 mon0 >>> [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 29796 : [INF] osd37 >>> [2a00:f10:113:1:225:90ff:fe33:49a4]:6803/17881 failed (by osd7 >>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6806/2746) >>> 2011-08-05 13:32:03.225964 mon e1: 1 mons at >>> {dga=[2a00:f10:113:1:230:48ff:fed3:b086]:6789/0} >>> 2011-08-05 13:32:03.929650 log 2011-08-05 13:32:03.603681 mon0 >>> [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 29797 : [INF] osd19 >>> [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 failed (by osd5 >>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506) >>> >>> Here you see that osd5 is reporting that osd37 is down and osd7 is >>> reporting that osd19 is down. >>> >>> These messages keep being repeated, over and over. My 'debug ms' is at >>> 1, so I took a look at the logs: >>> >>> Aug 5 13:32:41 atom1 osd.5[2507]: 7f75227fc700 osd5 9963 >>> heartbeat_check: no heartbeat from osd19 since 2011-08-04 >>> 17:01:02.504942 (cutoff 2011-08-05 13:32:21.916364) >>> Aug 5 13:32:42 atom1 osd.5[2507]: 7f7531579700 osd5 9963 >>> heartbeat_check: no heartbeat from osd19 since 2011-08-04 >>> 17:01:02.504942 (cutoff 2011-08-05 13:32:22.381632) >>> Aug 5 13:32:43 atom1 osd.5[2507]: 7f7531579700 osd5 9965 >>> heartbeat_check: no heartbeat from osd19 since 2011-08-04 >>> 17:01:02.504942 (cutoff 2011-08-05 13:32:23.474198) >>> Aug 5 13:32:44 atom1 osd.5[2507]: 7f75227fc700 osd5 9965 >>> heartbeat_check: no heartbeat from osd19 since 2011-08-04 >>> 17:01:02.504942 (cutoff 2011-08-05 13:32:24.027161) >>> >>> osd5 keeps claiming that osd19 is down, but when I check the logs I see >>> that osd19 is running and exchanging heartbeat messages with the other >>> OSDs in the cluster. >>> >>> What seems odd to me is that there are no log messages about osd7 >>> claiming that osd37 is down, although it keeps being repeated by 'ceph >>> -w'. >>> >>> $ ceph osd injectargs 7 '--debug-ms 1' >>> $ ceph osd injectargs 37 '--debug-ms 1' >>> >>> I made sure that was injected, but still, no log messages. >>> >>> Going through the logs a bit further (after setting the logging to 20) I >>> saw osd.19 reporting: >>> >>> Aug 5 13:48:23 atom4 osd.19[2741]: 7f634ca8e700 -- >>> [2a00:f10:113:1:225:90ff:fe33:49cc]:6810/2740>> >>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6808/2506 pipe(0x7f64e4193ec0 sd=56 >>> pgs=1530 cs=1 l=0).reader got envelope type=70 src osd5 front=61 data=0 >>> off 0 >>> >>> It seems that it is communicating with osd5, but osd5 does not seem to >>> communicate with osd19? >>> >>> The log also confirms that: >>> >>> Aug 5 14:10:05 atom1 osd.5[2507]: 7f7531579700 -- >>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506 --> >>> [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 -- osd_failure(osd19 >>> [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020 v10020) v1 -- ?+0 >>> 0x6948000 con 0x1cd7710 >>> Aug 5 14:10:05 atom1 osd.5[2507]: 7f7531579700 -- >>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506 submit_message >>> osd_failure(osd19 [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020 >>> v10020) v1 remote [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 >>> Aug 5 14:10:05 atom1 osd.5[2507]: 7f75365a1700 -- >>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506>> >>> [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 pipe(0x1cd74a0 sd=13 pgs=178 >>> cs=1 l=1).writer encoding 17093 0x6948000 osd_failure(osd19 >>> [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020 v10020) v1 >>> >>> Now, to prevent (I tried...) my cluster from bouncing around I placed >>> this in the OSD section: >>> >>> osd min down reporters = 3 >>> osd min down reports = 5 >>> >>> Since only osd5 and osd7 are sending out these messages osd19 nor osd37 >>> get marked down, but what is going on here? >>> >>> I haven't restarted the OSDs yet. >>> >>> I noticed that the OSD heartbeat process is being revised currently, but >>> I thought it might be helpful to report this. >>> >>> Wido >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html