On 04/12/2016 04:27 PM, Eric Hall wrote:
On 4/12/16 9:53 AM, Joao Eduardo Luis wrote:
So this looks like the monitors didn't remove version 1, but this may
just be a red herring.
What matters, really, is the values in 'first_committed' and
'last_committed'. If either first or last_committed happens to be '1',
then there may be a bug somewhere in the code, but I doubt that. This
seems just an artefact.
So, it would be nice if you could provide the value of both
'osdmap:first_committed' and 'osdmap:last_committed'.
mon1:
(osdmap, last_committed)
0000 : 01 00 00 00 00 00 00 00 : ........
(osdmap, fist_committed) does not exist
mon2:
(osdmap, last_committed)
0000 : 01 00 00 00 00 00 00 00 : ........
(osdmap, fist_committed) does not exist
mon3:
(osdmap, last_committed)
0000 : 01 00 00 00 00 00 00 00 : ........
(osdmap, first_committed)
0000 : b8 94 00 00 00 00 00 00
Wow! This is unexpected, but fits the assertion just fine.
The solution, I think, will be rewriting first_committed and
last_committed on all monitors - except on mon1.
As per your previous email, in which you listed the osdmap version
intervals for each monitor, it seems like mon1 contains incremental
versions [38072..38630] and full versions [38072..38456] - i.e., there's
a bunch of full versions missing from 38456 to 38630.
The other two monitors do not seem afflicted by this gap.
This will not be necessarily a problem as long as osdmap:full_latest
contains the version of the latest full map in the monitor's store. If
by any chance osdmap:full_latest contains a lower version than the
lowest full map version available, or a greater version than the highest
full map version, then problems will ensue. That said,
I would advise performing the following in a copy of your monitors
(injecting a custom monmap to make it run it solo[1]), so that any
still-running osds are not affected by any eventual side effects. Once
you are sure no assertions have been hit and the monitor is running
fine, feel free to apply these to your monitors.
1. set osdmap:first_committed to 38072
2. set osdmap:last_committed to 38630
3. set osdmap:full_latest to whatever is the latest full_XXXXX version
on the monitor.
3.1. this means 38630 on mon2 and mon3 - but 38456 on mon1
Setting versions should be as simple as
ceph-kvstore-tool ${MONDATA}/store.db set osdmap ${KEY} ver ${VER}
with ${KEY} being either first_committed, last_committed or full_latest
and ${VER} being the appropriate value.
Hope this helps.
-Joao
[1] This assert is only triggered once a quorum is formed, which means
you'll either have to have all the monitors running, or forcing the
quorum to be of just one single monitor.
Furthermore, the code is asserting on a basic check on
OSDMonitor::update_from_paxos(), which is definitely unexpected to fail.
It would also be nice if you could point us to a mon log with
'--debug-mon 20' from start to hitting the assertion. Feel free to send
it directly to me if you don't want the sitting on the internet.
Here is from mon2 (cephsecurestore2 IRL), which starts and dies with the
assert:
http://www.isis.vanderbilt.edu/mon2.log
Here is from mon3 (cephsecurestore3 IRL), which starts and runs, but
can't form quorum and never gives up on mon1 and mon2. Removing mon1
and mon2 from mon3's via modmap extract/rm/inject results in same FAILED
assert as others:
http://www.isis.vanderbilt.edu/mon3.log
My thought was that if I could resolve the last_committed problem on
mon3, then it might have a change sans mon1 and mon2.
Thank you,
--
Eric
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com