Re: another assertion failure in monitor

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



 
This whole thing started with migrating from 0.56.7 to 0.72.2. First, we
started seeing failed assertions of (version == pg_map.version) in
PGMonitor.cc:273, but on one monitor (d) only. I attempted to resync the
failing monitor (d) with --force-sync from (c). (d) started to work, but
(c) started to fail with (version==pg_map.version) assertion. So, I
tried re-syncing (c) from (d) with --force-resync. That's when (c)
started to fail with this particular (ret==0) assertion. I don't really
think that resyncing actually worked any at that point.
Based on this, my guess is that you managed to bork the mon stores of both 'c' and 'd'.  See, when you force a sync you're basically telling the monitor to delete its store's contents and sync from somebody else.  If 'c' had a broken store after the conversion, that would have been propagated to 'd'.  Once you forced the sync of 'c', then the problem would have been propagated from 'd' to 'c'.

Well, nothing suggested that (c) was having any problems, besides being lonely. That's why I asked (d) to re-sync from it (expecting exactly that it will rebuild the monitor store on (d), which was failing). Apparently, (c) wasn't any good either, but it wasn't obvious.
 



I didn't find a way to fix this quickly enough, so I restored the mon
directories from back up, and started again. The (version ==
pg_map.version) came back, but my back-up was taken before I was trying
to do force-resync, but not before the migration started (that was
stupid of me to not have backed up before migration). (That's the point
when I tried all kindsa crazy stuff for a while).

After some poking around, what I ended up doing is plain removing
'store.db' directory from the monitor fs, and starting the monitors.
That just re-initiated the migration, and this time it was done in the
absence of client requests, and one monitor at a time.

And in a case like this, I would think this was a smart choice, allowing the monitors to reconvert the store from the old plain, file-based format to the new store.db format.  Given it worked, my guess is that the source of all your issues was an improperly converted monitor store -- but, once again, without the logs we can't ever be sure. :(

Well, at this point I simply glad it worked. The situation was "OMG, the deployment is upside down", things get lost easy :)

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux