Re: Why would "osd marked itself down" will not recognised?

Christian Balzer <chibi@xxxxxxx> · Fri, 13 Jan 2017 09:25:06 +0900

On Thu, 12 Jan 2017 13:59:12 -0800 Samuel Just wrote:

> That would work.
> -Sam
> 
Having seen similar behavior in the past I made it a habit to manually
shut down services before a reboot.

Not limited to Ceph and these race conditions have definitely gotten worse
with systemd in general.

Christian

> On Thu, Jan 12, 2017 at 1:40 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> > On Thu, Jan 12, 2017 at 1:37 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:
> >> Oh, this is basically working as intended.  What happened is that the
> >> mon died before the pending map was actually committed.  The OSD has a
> >> timeout (5s) after which it stops trying to mark itself down and just
> >> dies (so that OSDs don't hang when killed).  It took a bit longer than
> >> 5s for the remaining 2 mons to form a new quorum, so they never got
> >> the MOSDMarkMeDown message so we had to do it the slow way.  I would
> >> prefer this behavior to changing the mon shutdown process or making
> >> the OSDs wait longer, so I think that's it.  If you want to avoid
> >> disruption with colocated mons and osds, stop the osds first
> >
> > We can probably make our systemd scripts do this automatically? Or at
> > least, there's a Ceph super-task thingy and I bet we can order the
> > shutdown so it waits to kill the monitor until all the OSDs processes
> > have ended.
> >
> >> and then
> >> reboot.
> >
> >
> >
> >> -Sam
> >>
> >> On Thu, Jan 12, 2017 at 1:24 PM, Udo Lembke <ulembke@xxxxxxxxxxxx> wrote:
> >>> Hi Sam,
> >>>
> >>> the webfrontend of an external ceph-dash was interrupted till the node
> >>> was up again. The reboot took app. 5 min.
> >>>
> >>> But  the ceph -w output shows some IO much faster. I will look tomorrow
> >>> at the output again and create an ticket.
> >>>
> >>>
> >>> Thanks
> >>>
> >>>
> >>> Udo
> >>>
> >>>
> >>> On 12.01.2017 20:02, Samuel Just wrote:
> >>>> How long did it take for the cluster to recover?
> >>>> -Sam
> >>>>
> >>>> On Thu, Jan 12, 2017 at 10:54 AM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> >>>>> On Thu, Jan 12, 2017 at 2:03 AM,  <ulembke@xxxxxxxxxxxx> wrote:
> >>>>>> Hi all,
> >>>>>> I had just reboot all 3 nodes (one after one) of an small Proxmox-VE
> >>>>>> ceph-cluster. All nodes are mons and have two OSDs.
> >>>>>> During reboot of one node, ceph stucks longer than normaly and I look in the
> >>>>>> "ceph -w" output to find the reason.
> >>>>>>
> >>>>>> This is not the reason, but I'm wonder why "osd marked itself down" will not
> >>>>>> recognised by the mons:
> >>>>>> 2017-01-12 10:18:13.584930 mon.0 [INF] osd.5 marked itself down
> >>>>>> 2017-01-12 10:18:13.585169 mon.0 [INF] osd.4 marked itself down
> >>>>>> 2017-01-12 10:18:22.809473 mon.2 [INF] mon.2 calling new monitor election
> >>>>>> 2017-01-12 10:18:22.847548 mon.0 [INF] mon.0 calling new monitor election
> >>>>>> 2017-01-12 10:18:27.879341 mon.0 [INF] mon.0@0 won leader election with
> >>>>>> quorum 0,2
> >>>>>> 2017-01-12 10:18:27.889797 mon.0 [INF] HEALTH_WARN; 1 mons down, quorum 0,2
> >>>>>> 0,2
> >>>>>> 2017-01-12 10:18:27.952672 mon.0 [INF] monmap e3: 3 mons at
> >>>>>> {0=10.132.7.11:6789/0,1=10.132.7.12:6789/0,2=10.132.7.13:6789/0}
> >>>>>> 2017-01-12 10:18:27.953410 mon.0 [INF] pgmap v4800799: 392 pgs: 392
> >>>>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 239 kB/s
> >>>>>> wr, 15 op/s
> >>>>>> 2017-01-12 10:18:27.953453 mon.0 [INF] fsmap e1:
> >>>>>> 2017-01-12 10:18:27.953787 mon.0 [INF] osdmap e2053: 6 osds: 6 up, 6 in
> >>>>>> 2017-01-12 10:18:29.013968 mon.0 [INF] pgmap v4800800: 392 pgs: 392
> >>>>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 73018 B/s
> >>>>>> wr, 12 op/s
> >>>>>> 2017-01-12 10:18:30.086787 mon.0 [INF] pgmap v4800801: 392 pgs: 392
> >>>>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 59 B/s
> >>>>>> rd, 135 kB/s wr, 15 op/s
> >>>>>> 2017-01-12 10:18:34.559509 mon.0 [INF] pgmap v4800802: 392 pgs: 392
> >>>>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 184 B/s
> >>>>>> rd, 189 kB/s wr, 7 op/s
> >>>>>> 2017-01-12 10:18:35.623838 mon.0 [INF] pgmap v4800803: 392 pgs: 392
> >>>>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
> >>>>>> 2017-01-12 10:18:39.580770 mon.0 [INF] pgmap v4800804: 392 pgs: 392
> >>>>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
> >>>>>> 2017-01-12 10:18:39.681058 mon.0 [INF] osd.4 10.132.7.12:6800/4064 failed (2
> >>>>>> reporters from different host after 21.222945 >= grace 20.388836)
> >>>>>> 2017-01-12 10:18:39.681221 mon.0 [INF] osd.5 10.132.7.12:6802/4163 failed (2
> >>>>>> reporters from different host after 21.222970 >= grace 20.388836)
> >>>>>> 2017-01-12 10:18:40.612401 mon.0 [INF] pgmap v4800805: 392 pgs: 392
> >>>>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
> >>>>>> 2017-01-12 10:18:40.670801 mon.0 [INF] osdmap e2054: 6 osds: 4 up, 6 in
> >>>>>> 2017-01-12 10:18:40.689302 mon.0 [INF] pgmap v4800806: 392 pgs: 392
> >>>>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
> >>>>>> 2017-01-12 10:18:41.730006 mon.0 [INF] osdmap e2055: 6 osds: 4 up, 6 in
> >>>>>>
> >>>>>> Why trust the mon not the osd? In this case the osdmap will be right app. 26
> >>>>>> seconds earlier (the pgmap at 10:18:27.953410 is wrong).
> >>>>>>
> >>>>>> ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
> >>>>> That's not what anybody intended to have happen. It's possible the
> >>>>> simultaneous loss of a monitor and the OSDs is triggering a case
> >>>>> that's not behaving correctly. Can you create a ticket at
> >>>>> tracker.ceph.com with your logs and what steps you took and symptoms
> >>>>> observed?
> >>>>> -Greg
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list
> >>>>> ceph-users@xxxxxxxxxxxxxx
> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com