filestore to bluestore: osdmap epoch problem and is the documentation correct?

"Jens-U. Mozdzen" <jmozdzen@xxxxxx> · Wed, 10 Jan 2018 13:57:47 +0000

Dear *,

has anybody been successful migrating Filestore OSDs to Bluestore  
OSDs, keeping the OSD number? There have been a number of messages on  
the list, reporting problems, and my experience is the same. (Removing  
the existing OSD and creating a new one does work for me.)

I'm working on an Ceph 12.2.2 cluster and tried following  
http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacing-an-osd - this basically  
says

1. destroy old OSD
2. zap the disk
3. prepare the new OSD
4. activate the new OSD

I never got step 4 to complete. The closest I got was by doing the  
following steps (assuming OSD ID "999" on /dev/sdzz):

1. Stop the old OSD via systemd (osd-node # systemctl stop  
ceph-osd@999.service)

2. umount the old OSD (osd-node # umount /var/lib/ceph/osd/ceph-999)

3a. if the old OSD was Bluestore with LVM, manually clean up the old  
OSD's volume group

3b. zap the block device (osd-node # ceph-volume lvm zap /dev/sdzz)

4. destroy the old OSD (osd-node # ceph osd destroy 999  
--yes-i-really-mean-it)

5. create a new OSD entry (osd-node # ceph osd new $(cat  
/var/lib/ceph/osd/ceph-999/fsid) 999)

6. add the OSD secret to Ceph authentication (osd-node # ceph auth add  
osd.999 mgr 'allow profile osd' osd 'allow *' mon 'allow profile osd'  
-i /var/lib/ceph/osd/ceph-999/keyring)

7. prepare the new OSD (osd-node # ceph-volume lvm prepare --bluestore  
--osd-id 999 --data /dev/sdzz)
mon 'allow profile osd' -i /var/lib/ceph/osd/ceph-999/keyring)

but ceph-osd keeps complaining "osdmap says I am destroyed, exiting"  
on "osd-node # systemctl start ceph-osd@999.service".

At first I felt I was hitting http://tracker.ceph.com/issues/21023  
(BlueStore-OSDs marked as destroyed in OSD-map after v12.1.1 to  
v12.1.4 upgrade). But I was already using the "ceph osd new" command,  
which didn't help.

Some hours of sleep later I matched the issued commands to the osdmap  
changes and the ceph-osd log messages, which revealed something strange:

- from issuing "ceph osd destroy", osdmap lists the OSD as  
"autoout,destroyed,exists" (no surprise here)
- once I issued "ceph osd new", osdmap lists the OSD as "autoout,exists,new"
- starting ceph-osd after "ceph osd new" reports "osdmap says I am  
destroyed, exiting"

I can see in the ceph-osd log that it is relating to an *old* osdmap  
epoch, roughly 45 minutes old by then?

This got me curious and I dug through the OSD log file, checking the  
epoch numbers during start-up:

I took some detours, so there's more than two failed starts in the OSD  
log file ;) :

--- cut here ---
# first of multiple attempts, before "ceph auth add ..."
# no actual epoch referenced, as login failed due to missing auth
2018-01-10 00:00:02.173983 7f5cf1c89d00  0 osd.999 0 crush map has  
features 288232575208783872, adjusting msgr requires for clients
2018-01-10 00:00:02.173990 7f5cf1c89d00  0 osd.999 0 crush map has  
features 288232575208783872 was 8705, adjusting msgr requires for mons
2018-01-10 00:00:02.173994 7f5cf1c89d00  0 osd.999 0 crush map has  
features 288232575208783872, adjusting msgr requires for osds
2018-01-10 00:00:02.174046 7f5cf1c89d00  0 osd.999 0 load_pgs
2018-01-10 00:00:02.174051 7f5cf1c89d00  0 osd.999 0 load_pgs opened 0 pgs
2018-01-10 00:00:02.174055 7f5cf1c89d00  0 osd.999 0 using  
weightedpriority op queue with priority op cut off at 64.
2018-01-10 00:00:02.174891 7f5cf1c89d00 -1 osd.999 0 log_to_monitors  
{default=true}
2018-01-10 00:00:02.177479 7f5cf1c89d00 -1 osd.999 0 init  
authentication failed: (1) Operation not permitted

# after "ceph auth ..."
# note the different epochs below? BTW, 110587 is the current epoch at  
that time and osd.999 is marked destroyed there
# 109892: much too old to offer any details
# 110587: modified 2018-01-09 23:43:13.202381

2018-01-10 00:08:00.945507 7fc55905bd00  0 osd.999 0 crush map has  
features 288232575208783872, adjusting msgr requires for clients
2018-01-10 00:08:00.945514 7fc55905bd00  0 osd.999 0 crush map has  
features 288232575208783872 was 8705, adjusting msgr requires for mons
2018-01-10 00:08:00.945521 7fc55905bd00  0 osd.999 0 crush map has  
features 288232575208783872, adjusting msgr requires for osds
2018-01-10 00:08:00.945588 7fc55905bd00  0 osd.999 0 load_pgs
2018-01-10 00:08:00.945594 7fc55905bd00  0 osd.999 0 load_pgs opened 0 pgs
2018-01-10 00:08:00.945599 7fc55905bd00  0 osd.999 0 using  
weightedpriority op queue with priority op cut off at 64.
2018-01-10 00:08:00.946544 7fc55905bd00 -1 osd.999 0 log_to_monitors  
{default=true}
2018-01-10 00:08:00.951720 7fc55905bd00  0 osd.999 0 done with init,  
starting boot process
2018-01-10 00:08:00.952225 7fc54160a700 -1 osd.999 0 waiting for  
initial osdmap
2018-01-10 00:08:00.970644 7fc546614700  0 osd.999 109892 crush map  
has features 288232610642264064, adjusting msgr requires for clients
2018-01-10 00:08:00.970653 7fc546614700  0 osd.999 109892 crush map  
has features 288232610642264064 was 288232575208792577, adjusting msgr  
requires for mons
2018-01-10 00:08:00.970660 7fc546614700  0 osd.999 109892 crush map  
has features 1008808551021559808, adjusting msgr requires for osds
2018-01-10 00:08:01.349602 7fc546614700 -1 osd.999 110587 osdmap says  
I am destroyed, exiting

# another try
# it is now using epoch 110587 for everything. But that one is off by  
one at that time already:
# 110587: modified 2018-01-09 23:43:13.202381
# 110588: modified 2018-01-10 00:12:55.271913

# but both 110587 and 110588 have osd.999 as "destroyed", so never mind.
2018-01-10 00:13:04.332026 7f408d5a4d00  0 osd.999 110587 crush map  
has features 288232610642264064, adjusting msgr requires for clients
2018-01-10 00:13:04.332037 7f408d5a4d00  0 osd.999 110587 crush map  
has features 288232610642264064 was 8705, adjusting msgr requires for  
mons
2018-01-10 00:13:04.332043 7f408d5a4d00  0 osd.999 110587 crush map  
has features 1008808551021559808, adjusting msgr requires for osds
2018-01-10 00:13:04.332092 7f408d5a4d00  0 osd.999 110587 load_pgs
2018-01-10 00:13:04.332096 7f408d5a4d00  0 osd.999 110587 load_pgs  
opened 0 pgs
2018-01-10 00:13:04.332100 7f408d5a4d00  0 osd.999 110587 using  
weightedpriority op queue with priority op cut off at 64.
2018-01-10 00:13:04.332990 7f408d5a4d00 -1 osd.999 110587  
log_to_monitors {default=true}
2018-01-10 00:13:06.026628 7f408d5a4d00  0 osd.999 110587 done with  
init, starting boot process
2018-01-10 00:13:06.027627 7f4075352700 -1 osd.999 110587 osdmap says  
I am destroyed, exiting

# the attempt after using "ceph osd new", which created epoch 110591  
as the first with osd.999 as autoout,exists,new
# But ceph-osd still uses 110587.
# 110587: modified 2018-01-09 23:43:13.202381
# 110591: modified 2018-01-10 00:30:44.850078

2018-01-10 00:31:15.453871 7f1c57c58d00  0 osd.999 110587 crush map  
has features 288232610642264064, adjusting msgr requires for clients
2018-01-10 00:31:15.453882 7f1c57c58d00  0 osd.999 110587 crush map  
has features 288232610642264064 was 8705, adjusting msgr requires for  
mons
2018-01-10 00:31:15.453887 7f1c57c58d00  0 osd.999 110587 crush map  
has features 1008808551021559808, adjusting msgr requires for osds
2018-01-10 00:31:15.453940 7f1c57c58d00  0 osd.999 110587 load_pgs
2018-01-10 00:31:15.453945 7f1c57c58d00  0 osd.999 110587 load_pgs  
opened 0 pgs
2018-01-10 00:31:15.453952 7f1c57c58d00  0 osd.999 110587 using  
weightedpriority op queue with priority op cut off at 64.
2018-01-10 00:31:15.454862 7f1c57c58d00 -1 osd.999 110587  
log_to_monitors {default=true}
2018-01-10 00:31:15.520533 7f1c57c58d00  0 osd.999 110587 done with  
init, starting boot process
2018-01-10 00:31:15.521278 7f1c40207700 -1 osd.999 110587 osdmap says  
I am destroyed, exiting
--- cut here ---

So why is ceph-osd referring to an old osdmap, while newer ones are  
available for some time already?

And am I right to believe that *if* ceph-osd had checked the then  
current osdmap, it would have started successfully (once I did the  
"ceph osd new" that's not mentioned in the docs)?

Is the documented procedure (from the "master" HTML docs) correct, or  
should the "ceph auth" and "ceph osd new" steps get added?

Regards,
Jens

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com