On Fri, 13 Sep 2013, Dominik Mostowiec wrote: > Hi, > I have ntpd installed on servers, time seems to be ok. > > I have strange log: > 2013-09-12 07:34:40.238659 7fd63ac3e700 -1 > mon.4@3(peon).p0.075434axos(auth active c 581328..581348) lease_expire > from mon.0 10.177.64.4:6789/0 is seconds in the past; mons are laggy > or clocks are too skewed > > But value 0.075434 is small. > > In ceph.conf i have: > mon allowed clock drift = 2 This is too high.. it should stay under a second. 50ms is the default, by the way, but something like 500ms ought to be fine. sage > > Some time mons reports: > 2013-09-13 00:11:14.556410 7fd63ac3e700 0 log [INF] : mon.4 calling > new monitor election > 2013-09-13 00:11:14.557306 7fd6317b8700 0 -- 10.177.64.7:6789/0 >> > 10.177.64.5:6789/0 pipe(0xdbc2000 sd=18 :6789 s=0 pgs=0 cs=0 > l=0).accept connect_seq 112 vs existing 112 state connecting > 2013-09-13 00:11:14.557374 7fd638525700 0 -- 10.177.64.7:6789/0 >> > 10.177.64.9:6789/0 pipe(0x14766c80 sd=24 :6789 s=0 pgs=0 cs=0 > l=0).accept connect_seq 112 vs existing 112 state connecting > 2013-09-13 00:11:14.557398 7fd638c2c700 0 -- 10.177.64.7:6789/0 >> > 10.177.64.6:6789/0 pipe(0x14766a00 sd=23 :6789 s=0 pgs=0 cs=0 > l=0).accept connect_seq 126 vs existing 126 state connecting > 2013-09-13 00:11:16.467636 7fd631ebf700 0 -- 10.177.64.7:6789/0 >> > 10.177.64.8:6789/0 pipe(0x7038c80 sd=20 :6789 s=0 pgs=0 cs=0 > l=0).accept connect_seq 122 vs existing 122 state connecting > 2013-09-13 00:11:21.553559 7fd63ac3e700 0 log [INF] : mon.4 calling > new monitor election > > > -- > Dominik > > 2013/9/13 Joao Eduardo Luis <joao.luis@xxxxxxxxxxx>: > > On 09/13/2013 03:38 AM, Sage Weil wrote: > >> > >> On Thu, 12 Sep 2013, Dominik Mostowiec wrote: > >>> > >>> Hi, > >>> Today i have some issues with ceph cluster. > >>> After new mon election many osd has been marked failed. > >>> Some time later osd boot and i think recover because meny slow request > >>> appear. > >>> Cluster come back after about 20minutes. > >> > >> > >> Was there some other event that triggered the mon election? There's not > >> much here to go on except that several elections were called and by > >> different monitors, which suggests something was not quite right. > > > > > > My best guess would be clock skews, as they tend to be annoying like that. > > > > Setting 'debug mon = 10' on the monitors should provide more insight though. > > > > -Joao > > > >> > >> sage > >> > >> > >>> > >>> cluster: > >>> ceph version 0.56.6 > >>> 6 servers x 26 osd > >>> > >>> 2013-09-12 07:11:40.920384 mon.1 10.177.64.5:6789/0 353 : [INF] mon.3 > >>> calling new monitor election > >>> 2013-09-12 07:12:40.992532 mon.3 10.177.64.7:6789/0 364 : [INF] mon.4 > >>> calling new monitor election > >>> 2013-09-12 07:12:41.024954 mon.4 10.177.64.8:6789/0 360 : [INF] mon.2 > >>> calling new monitor election > >>> 2013-09-12 07:13:02.782203 mon.2 10.177.64.6:6789/0 336 : [INF] mon.1 > >>> calling new monitor election > >>> 2013-09-12 07:13:02.783778 mon.3 10.177.64.7:6789/0 366 : [INF] mon.4 > >>> calling new monitor election > >>> 2013-09-12 07:13:10.852842 mon.3 10.177.64.7:6789/0 367 : [INF] mon.4 > >>> calling new monitor election > >>> 2013-09-12 16:17:09.484277 mon.4 10.177.64.8:6789/0 363 : [INF] mon.2 > >>> calling new monitor election > >>> 2013-09-12 16:17:09.497337 mon.3 10.177.64.7:6789/0 368 : [INF] mon.4 > >>> calling new monitor election > >>> 2013-09-12 16:17:09.523787 mon.0 10.177.64.4:6789/0 4369021 : [INF] > >>> mon.0 calling new monitor election > >>> 2013-09-12 16:17:14.525282 mon.0 10.177.64.4:6789/0 4369022 : [INF] > >>> mon.0@0 won leader election with quorum 0,1,2,3,4 > >>> ... > >>> 2013-09-12 16:17:14.689555 mon.0 10.177.64.4:6789/0 4369027 : [DBG] > >>> osd.130 10.177.64.9:6801/1401 reported failed by osd.121 > >>> 10.177.64.7:6909/29496 > >>> 2013-09-12 16:17:14.689584 mon.0 10.177.64.4:6789/0 4369028 : [DBG] > >>> osd.131 10.177.64.9:6810/2435 reported failed by osd.121 > >>> 10.177.64.7:6909/29496 > >>> 2013-09-12 16:17:14.689600 mon.0 10.177.64.4:6789/0 4369029 : [DBG] > >>> osd.132 10.177.64.9:6846/2885 reported failed by osd.121 > >>> 10.177.64.7:6909/29496 > >>> 2013-09-12 16:17:14.689615 mon.0 10.177.64.4:6789/0 4369030 : [DBG] > >>> osd.134 10.177.64.9:6855/3223 reported failed by osd.121 > >>> 10.177.64.7:6909/29496 > >>> 2013-09-12 16:17:14.689630 mon.0 10.177.64.4:6789/0 4369031 : [DBG] > >>> osd.136 10.177.64.9:6865/3559 reported failed by osd.121 > >>> 10.177.64.7:6909/29496 > >>> 2013-09-12 16:17:14.689645 mon.0 10.177.64.4:6789/0 4369032 : [DBG] > >>> osd.141 10.177.64.9:6904/4259 reported failed by osd.121 > >>> 10.177.64.7:6909/29496 > >>> > >>> -- > >>> Pozdrawiam > >>> Dominik > >>> -- > >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >>> the body of a message to majordomo@xxxxxxxxxxxxxxx > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > >>> > >>> > >> _______________________________________________ > >> ceph-users mailing list > >> ceph-users@xxxxxxxxxxxxxx > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > > > > > > -- > > Joao Eduardo Luis > > Software Engineer | http://inktank.com | http://ceph.com > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- > Pozdrawiam > Dominik > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com