Hi all,
I had just reboot all 3 nodes (one after one) of an small Proxmox-VE
ceph-cluster. All nodes are mons and have two OSDs.
During reboot of one node, ceph stucks longer than normaly and I look
in the
"ceph -w" output to find the reason.
This is not the reason, but I'm wonder why "osd marked itself down"
will not
recognised by the mons:
2017-01-12 10:18:13.584930 mon.0 [INF] osd.5 marked itself down
2017-01-12 10:18:13.585169 mon.0 [INF] osd.4 marked itself down
2017-01-12 10:18:22.809473 mon.2 [INF] mon.2 calling new monitor
election
2017-01-12 10:18:22.847548 mon.0 [INF] mon.0 calling new monitor
election
2017-01-12 10:18:27.879341 mon.0 [INF] mon.0@0 won leader election
with
quorum 0,2
2017-01-12 10:18:27.889797 mon.0 [INF] HEALTH_WARN; 1 mons down,
quorum 0,2
0,2
2017-01-12 10:18:27.952672 mon.0 [INF] monmap e3: 3 mons at
{0=10.132.7.11:6789/0,1=10.132.7.12:6789/0,2=10.132.7.13:6789/0}
2017-01-12 10:18:27.953410 mon.0 [INF] pgmap v4800799: 392 pgs: 392
active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 239
kB/s
wr, 15 op/s
2017-01-12 10:18:27.953453 mon.0 [INF] fsmap e1:
2017-01-12 10:18:27.953787 mon.0 [INF] osdmap e2053: 6 osds: 6 up, 6
in
2017-01-12 10:18:29.013968 mon.0 [INF] pgmap v4800800: 392 pgs: 392
active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail;
73018 B/s
wr, 12 op/s
2017-01-12 10:18:30.086787 mon.0 [INF] pgmap v4800801: 392 pgs: 392
active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 59
B/s
rd, 135 kB/s wr, 15 op/s
2017-01-12 10:18:34.559509 mon.0 [INF] pgmap v4800802: 392 pgs: 392
active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 184
B/s
rd, 189 kB/s wr, 7 op/s
2017-01-12 10:18:35.623838 mon.0 [INF] pgmap v4800803: 392 pgs: 392
active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
2017-01-12 10:18:39.580770 mon.0 [INF] pgmap v4800804: 392 pgs: 392
active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
2017-01-12 10:18:39.681058 mon.0 [INF] osd.4 10.132.7.12:6800/4064
failed (2
reporters from different host after 21.222945 >= grace 20.388836)
2017-01-12 10:18:39.681221 mon.0 [INF] osd.5 10.132.7.12:6802/4163
failed (2
reporters from different host after 21.222970 >= grace 20.388836)
2017-01-12 10:18:40.612401 mon.0 [INF] pgmap v4800805: 392 pgs: 392
active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
2017-01-12 10:18:40.670801 mon.0 [INF] osdmap e2054: 6 osds: 4 up, 6
in
2017-01-12 10:18:40.689302 mon.0 [INF] pgmap v4800806: 392 pgs: 392
active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
2017-01-12 10:18:41.730006 mon.0 [INF] osdmap e2055: 6 osds: 4 up, 6
in
Why trust the mon not the osd? In this case the osdmap will be right
app. 26
seconds earlier (the pgmap at 10:18:27.953410 is wrong).
ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
regards
Udo
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com