Re: mon upgrades and leveldb->rocksdb conversion

Sage Weil <sage@xxxxxxxxxxxx> · Thu, 11 Oct 2018 13:58:01 +0000 (UTC)

> > On Mon, 24 Sep 2018, Sage Weil wrote:
> >> For the arch linux upgrade issue, the working theory is currently that 
> >> there was some subtle problem with moving from arch's leveldb to 
> >> the static rocksdb in mimic and made the mons discard a bunch of recent 
> >> updates, warping them back in time.

Well, it looks like this theory is wrong: the MonitorDBStore behavior wrt 
the level->rocksdb opens has not changed since before luminous, and it is 
also not doing any implicit change: if the kv_backend file is missing 
it is assumed to mean "leveldb" since 675f6d9d880.  So... I'm not really 
sure what caused Goktug's mons to warp back in time.  :(

Setting that aside, though, we still do have the question of how to 
transition users from leveldb mons to rocksdb mons:

On Thu, 11 Oct 2018, Joao Eduardo Luis wrote:
> On 10/10/2018 05:21 PM, Sage Weil wrote:
> > Bumping this message w/ a new subject since it got lost in the other 
> > thread...
> 
> So, on the context of leveldb->rocksdb conversion, this has come up at
> least a couple of times due to some downstream interest in migrating
> clusters from leveldb to rocksdb. I must say that the option I
> personally like the most is a live migration while the monitor is in
> quorum. I also think this should be mandated by the administrator,
> rather than have ceph-mon unilaterally deciding "it's time".
> 
> I understand that this adds complexity to the process, but I do think it
> makes more sense to just creating a new rocksdb store side-by-side with
> the existing leveldb store, and
> 
>  1. new values go into rocksdb; and
>  2. old values are asynchronously moved to rocksdb.
> 
> We will reach a point where leveldb can be closed and not opened again,
> and instruct the administrator or whatever to clean up the existing store.
> 
> Now, besides the added complexity, this could also require more disk
> space. I don't think it will be a lot, given we will be trimming from
> leveldb as well, but I'm assuming it will not be nothing.
> 
> If this is an upgrade solution that seems reasonable, I'm happy to work
> on it soon.

I'm worried the above it a lot of complexity and opportunity for bugs 
(and work to implement) for not a lot of gain.  What if we instead make 
ceph-monstore-tool have a 'convert' function that will do a conversion 
offline?  The admin can take each mon down in turn, convert it, and bring 
it back up.  Provisioning tools could automate this process.

This will require ~2x the disk space for the conversion.  OTOH, if space 
is tight, the user can also just blow away the mon entirely and create 
it, and let the normal sync bring it back into quorum...

sage

> 
>   -Joao
> 
> > 
> >> Possible ways to mitigate this:
> >>
> >> 1- Do not silently upgrade from leveldb to rocksdb.  We haven't seen these 
> >> problems before, but if they are possible, better safe than sorry.  It 
> >> means any ugpraded clusters would be stuck on leveldb forever unless/until 
> >> some manual change is made by the operator.  :/
> >>
> >> 2- Update the upgrade documentation to only restart a single mon at a 
> >> time, and make sure it fully syncs and reforms quorum before doing the 
> >> next monitor.  This should have meant that even if a mon warped back in 
> >> time, when it joins quorum it would catch back up.  The cluster in 
> >> question must have restarted all mons at the same time.
> >>
> >> 3- Do an explicit conversion on upgrade (open with leveldb, rewriting into 
> >> rocksdb) instead of relying on the transparent upgrade from 
> >> leveldb->rocksdb working.  This is a new code path that would get 
> >> exercised (and more code to write) but would also sidestep the issue with 
> >> ubuntu's leveldb library having the annoying nonstandard .ldb extensions 
> >> for files that prevent rocksdb upgrades.
> >>
> >> Other thoughts?
> >> sage
> >>
> >>
> >>
> >> On Mon, 24 Sep 2018, by morphin wrote:
> >>
> >>> Hey Sage!
> >>>
> >>> Thanks for great support! You have saved me from a great trouble :)
> >>>
> >>> The failure reason was that monitors epoch number is much different
> >>> than the osds. We have rebuild the store.db with
> >>> ceph-objectstore-tool. I will post details later.
> >>>
> >>> And the reason of this problem seems to be my distro Archlinux. As far
> >>> as we analyze the way of Arch's package building might cause this
> >>> problem.
> >>>
> >>> Thanks again Sage! Ceph saved my day!
> >>> Sage Weil <sage@xxxxxxxxxxxx>, 24 Eyl 2018 Pzt, 04:43 tarihinde şunu yazdı:
> >>>>
> >>>> Some of the mons only have debug_ms=1 and not debug_mon=20, so I still
> >>>> can't find an instance where it has logged the mon processing an osd_boot
> >>>> message. Can you set debug_mon=20 and debug_ms=1 on all mons, restart all
> >>>> mons, and then restart several OSDs, so that we capture one?
> >>>>
> >>>> Thanks
> >>>> s
> >>>>
> >>>> On Mon, 24 Sep 2018, by morphin wrote:
> >>>>
> >>>>> Hi Sage,
> >>>>>
> >>>>> Thanks for your help!
> >>>>>
> >>>>> I am really desperate here. It is less then 6 hours for the day starts here.
> >>>>> 3mons is the attached log of all mons (all there).
> >>>>> SEKUARK1 is the log of mon after restarting the osd.0.
> >>>>>
> >>>>> Hope this helps!
> >>>>>
> >>>>> Best Regards,
> >>>>>
> >>>>> ceph -w https://paste.ubuntu.com/p/HkscWdbWWW/
> >>>>> ceph osd dump  https://paste.ubuntu.com/p/rXwxZRyNXC/
> >>>>>
> >>>>> MON's + OSD log:
> >>>>> https://www.dropbox.com/sh/g0o2eaw5zh2lccf/AADCz_ClkTl7UCHjVwRIYYiKa?dl=0
> >>>>>
> >>>>>
> >>>>>
> >>>>> Sage Weil <sage@xxxxxxxxxxxx>, 24 Eyl 2018 Pzt, 00:29 tarihinde şunu yazdı:
> >>>>>>
> >>>>>> It looks like the OSD is sending the boot message but the mon is not
> >>>>>> marking it up.  Can you attach the output of 'ceph osd dump'?  Also, can
> >>>>>> you restart an OSD after the mon debug levels are turned up and then
> >>>>>> attach the mon logs?  I don't see it processing any osd_boot messages.
> >>>>>>
> >>>>>> (And add debug ms = 1 on the mons)
> >>>>>>
> >>>>>> Thanks!
> >>>>>> s
> >>>>>>
> >>>>>> On Sun, 23 Sep 2018, by morphin wrote:
> >>>>>>
> >>>>>>> I collect more logs for you.
> >>>>>>> I started 2 osd (osd8 on A DC, and osd156 on B DC) with -debug-osd=20;
> >>>>>>>
> >>>>>>> OSD8: https://www.dropbox.com/s/5e01f5odtsq3iqi/ceph-osd.8.log?dl=0
> >>>>>>> OSD156: https://www.dropbox.com/s/ox7or2uizyiwdo7/ceph-osd.156.log?dl=0
> >>>>>>>
> >>>>>>> ceph osd stat 168 osds: 0 up, 168 in; epoch: e37506
> >>>>>>> Ceph -w https://paste.ubuntu.com/p/pRhPKvjqJK/
> >>>>>>>
> >>>>>>> by morphin <morphinwithyou@xxxxxxxxx>, 23 Eyl 2018 Paz, 17:25
> >>>>>>> tarihinde şunu yazdı:
> >>>>>>>>
> >>>>>>>> I tried but I couldn't find a clear shoot from that.
> >>>>>>>>
> >>>>>>>> OSD: https://paste.ubuntu.com/p/P79fHxTv2G/
> >>>>>>>> MON: https://paste.ubuntu.com/p/yRnG9DwWpq/
> >>>>>>>> David Conisbee <davidconisbee@xxxxxxxxx>, 23 Eyl 2018 Paz, 16:54
> >>>>>>>> tarihinde şunu yazdı:
> >>>>>>>>>
> >>>>>>>>> Have you tried debug osd = 5/5 in your ceph.conf to get more logging?
> >>>>>>>>>
> >>>>>>>>> On Sun, 23 Sep 2018, 11:41 morph in, <morphinwithyou@xxxxxxxxx> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hello again.
> >>>>>>>>>>
> >>>>>>>>>> I'm sending 2nd mail because my problem is very urgent. I'd be very
> >>>>>>>>>> grateful if somebody helps.
> >>>>>>>>>>
> >>>>>>>>>> After Luminous to Mimic upgrade when I try to start an OSD. Its
> >>>>>>>>>> stucking at "booting". (I edit the hostnames so do not care if they're
> >>>>>>>>>> not identical.)
> >>>>>>>>>>
> >>>>>>>>>> OSD log: https://paste.ubuntu.com/p/hFhc2dkSqb/
> >>>>>>>>>> MON log: https://paste.ubuntu.com/p/F85mYwvP4C/
> >>>>>>>>>> MGR log: https://paste.ubuntu.com/p/jYQ5kJstnH/
> >>>>>>>>>> CEPH.conf https://paste.ubuntu.com/p/qDwjzdsmGK/
> >>>>>>>>>> Telnet OSD to MON: https://paste.ubuntu.com/p/fbn9hTWv8q/
> >>>>>>>>>>
> >>>>>>>>>> I upgraded the system with this order:
> >>>>>>>>>>
> >>>>>>>>>> 1- Stop MDS ->OSD's -> MGR -> MON -> Servers
> >>>>>>>>>> 2- Upgrade OS image 4.14.30-1-lts to --> 4.14.70-1-lts "Ceph,kernel etc"
> >>>>>>>>>> 3- Reboot server and restore backups.
> >>>>>>>>>> 4- Start mons, check was ok.
> >>>>>>>>>> 5- Start mgrs, check was ok.
> >>>>>>>>>> 6- Check versions; https://paste.ubuntu.com/p/bxqF9wgDMn/
> >>>>>>>>>> 7- Start osds, All the osd's stuck at "booting":
> >>>>>>>>>> https://paste.ubuntu.com/p/NY6SP2MBmd/
> >>>>>>>>>> 8- I did not start MDS.
> >>>>>>>>>>
> >>>>>>>>>> Above procedure was tested on my test servers. I tried to upgrade 3
> >>>>>>>>>> test server with this order. And when I start OSD's, they started
> >>>>>>>>>> pretty fast without problems. My cluster health was OK. However in my
> >>>>>>>>>> PROD cluster upgrade OSD does start but they stuck at booting status.
> >>>>>>>>>> The only difference of PROD is the network and the count of OSDs.
> >>>>>>>>>>
> >>>>>>>>>> I need a debug method for OSD's. Because OSD's do not give any clue
> >>>>>>>>>> what should I do!
> >>>>>>>>>> As you can see my mons & mgr, are properly working. But OSD's are not.
> >>>>>>>>>> I think this because they can't talk to MON's somehow.
> >>>>>>>>>> I tried to marking all the OSD's "down" + restart all OSD's but
> >>>>>>>>>> nothing's changed. I checked network communication between osd's and
> >>>>>>>>>> mon's and it seems fine.  I'm using 10G LACP with jumbo frame for
> >>>>>>>>>> cluster network and 10G LACP for public network. And it was working
> >>>>>>>>>> very well before the upgrade.
> >>>>>>>>>>
> >>>>>>>>>> I checked everything what I know. My last choice is to downgrade and I
> >>>>>>>>>> don't know if it solves my problem or not.
> >>>>>>>>>> My hours limited. I have large amounts of data within data pool. It
> >>>>>>>>>> needs to be ready on Monday.
> >>>>>>>>>>
> >>>>>>>>>> Please help me if you can.
> >>>>>>>>>>
> >>>>>>>>>> Best Regards.
> >>>>>>>>>> morph in <morphinwithyou@xxxxxxxxx>, 23 Eyl 2018 Paz, 01:43 tarihinde
> >>>>>>>>>> şunu yazdı:
> >>>>>>>>>>>
> >>>>>>>>>>> Hello. I upgraded my system luminous to mimic
> >>>>>>>>>>> I have 168 osd in my system. Im using raid1 nvme for journals. And my pool was healty before upgrade.
> >>>>>>>>>>> I'dont upgrade my system with any update tools like apt, pacman.. I'm using images so my all OS are the same and the upgrade was in maintenance mod. Cluster was closed. I tested this upgrade 3 times on test cluster system with 2 server with 12 osd.
> >>>>>>>>>>> After upgrade on my prod cluster I see the OSD's are still at booting stage.
> >>>>>>>>>>> And It was too fast before mimic when I reboot my cluster.
> >>>>>>>>>>> I followed step-by-step mimic upgrade wiki.
> >>>>>>>>>>> ceph -s : https://paste.ubuntu.com/p/p2spVmqvJZ/
> >>>>>>>>>>> an osd log: https://paste.ubuntu.com/p/PBG66qdHXc/
> >>>>>>>>>>> ceph daemon status https://paste.ubuntu.com/p/y7cVspr9cN/
> >>>>>>>>>>> 1- Why the hell the "ceph -s" shows like that if the osd's booting. Its so stupid and scary. And I didn't even start any mds.
> >>>>>>>>>>> 2- Why the booting takes too long? Is it because mimic upgrade or something else?
> >>>>>>>>>>> 3- Waiting for the osd boots will be solve my problem or should I do something?
> >>>>>>>>>>>
> >>>>>>>>>>> -----------------------------
> >>>>>>>>>>> ceph mon feature ls
> >>>>>>>>>>> all features
> >>>>>>>>>>> supported: [kraken,luminous,mimic,osdmap-prune]
> >>>>>>>>>>> persistent: [kraken,luminous,mimic,osdmap-prune]
> >>>>>>>>>>> on current monmap (epoch 10)
> >>>>>>>>>>> persistent: [kraken,luminous,mimic,osdmap-prune]
> >>>>>>>>>>> required: [kraken,luminous,mimic,osdmap-prune]
> >>>>>>>>>>>
> >>>>>>>>>>> ------------------------
> >>>>>>>>>>> ceph osd versions
> >>>>>>>>>>> {
> >>>>>>>>>>>     "ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)": 50
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>> After all Im leaving my cluster in this State. 8 hour later I will be back. I need a running system at monday morning.
> >>>>>>>>>>> Help me please.
> >>>>>>> _______________________________________________
> >>>>>>> Ceph-community mailing list
> >>>>>>> Ceph-community@xxxxxxxxxxxxxx
> >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com
> >>>>>>>
> >>>>>
> >>>>>
> >>>
> 
>