Re: filestore to bluestore: osdmap epoch problem and is the documentation correct?

Reed Dier <reed.dier@xxxxxxxxxxx> · Thu, 11 Jan 2018 14:24:52 -0600

Thank you for documenting your progress and peril on the ML.

Luckily I only have 24x 8TB HDD and 50x 1.92TB SSDs to migrate over to bluestore.

8 nodes, 4 chassis (failure domain), 3 drives per node for the HDDs, so I’m able to do about 3 at a time (1 node) for rip/replace.

Definitely taking it slow and steady, and the SSDs will move quickly for backfills as well.
Seeing about 1TB/6hr on backfills, without much performance hit on rest of everything, about 5TB average util on each 8TB disk, so just about 30 hours-ish per host *8 hosts will be about 10 days, so a couple weeks is a safe amount of headway.
This write performance certainly seems better on bluestore than filestore, so that likely helps as well.

Expect I can probably refill an SSD osd in about an hour or two, and will likely stagger those out.
But with such a small number of osd’s currently, I’m taking the by-hand approach rather than scripting it so as to avoid similar pitfalls.

Reed 

On Jan 11, 2018, at 12:38 PM, Brady Deetz <bdeetz@xxxxxxxxx> wrote:

I hear you on time. I have 350 x 6TB drives to convert. I recently posted about a disaster I created automating my migration. Good luck

On Jan 11, 2018 12:22 PM, "Reed Dier" <reed.dier@xxxxxxxxxxx> wrote:
I am in the process of migrating my OSDs to bluestore finally and thought I would give you some input on how I am approaching it.
Some of saga you can find in another ML thread here: https://www.spinics.net/lists/ceph-users/msg41802.html

My first OSD I was cautious, and I outed the OSD without downing it, allowing it to move data off.
Some background on my cluster, for this OSD, it is an 8TB spinner, with an NVMe partition previously used for journaling in filestore, intending to be used for block.db in bluestore.

Then I downed it, flushed the journal, destroyed it, zapped with ceph-volume, set norecover and norebalance flags, did ceph osd crush remove osd.$ID, ceph auth del osd.$ID, and ceph osd rm osd.$ID and used ceph-volume locally to create the new LVM target. Then unset the norecover and norebalance flags and it backfilled like normal.

I initially ran into issues with specifying --osd.id causing my osd’s to fail to start, but removing that I was able to get it to fill in the gap of the OSD I just removed.

I’m now doing quicker, more destructive migrations in an attempt to reduce data movement.
This way I don’t read from OSD I’m replacing, write to other OSD temporarily, read back from temp OSD, write back to ‘new’ OSD.
I’m just reading from replica and writing to ‘new’ OSD.

So I’m setting the norecover and norebalance flags, down the OSD (but not out, it stays in, also have the noout flag set), destroy/zap, recreate using ceph-volume, unset the flags, and it starts backfilling.
For 8TB disks, and with 23 other 8TB disks in the pool, it takes a long time to offload it and then backfill back from them. I trust my disks enough to backfill from the other disks, and its going well. Also seeing very good write performance backfilling compared to previous drive replacements in filestore, so thats very promising.

Reed

On Jan 10, 2018, at 8:29 AM, Jens-U. Mozdzen <jmozdzen@xxxxxx> wrote:

Hi Alfredo,

thank you for your comments:

Zitat von Alfredo Deza <adeza@xxxxxxxxxx>:
On Wed, Jan 10, 2018 at 8:57 AM, Jens-U. Mozdzen <jmozdzen@xxxxxx> wrote:
Dear *,

has anybody been successful migrating Filestore OSDs to Bluestore OSDs,
keeping the OSD number? There have been a number of messages on the list,
reporting problems, and my experience is the same. (Removing the existing
OSD and creating a new one does work for me.)

I'm working on an Ceph 12.2.2 cluster and tried following
http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacing-an-osd
- this basically says

1. destroy old OSD
2. zap the disk
3. prepare the new OSD
4. activate the new OSD

I never got step 4 to complete. The closest I got was by doing the following
steps (assuming OSD ID "999" on /dev/sdzz):

1. Stop the old OSD via systemd (osd-node # systemctl stop
ceph-osd@999.service)

2. umount the old OSD (osd-node # umount /var/lib/ceph/osd/ceph-999)

3a. if the old OSD was Bluestore with LVM, manually clean up the old OSD's
volume group

3b. zap the block device (osd-node # ceph-volume lvm zap /dev/sdzz)

4. destroy the old OSD (osd-node # ceph osd destroy 999
--yes-i-really-mean-it)

5. create a new OSD entry (osd-node # ceph osd new $(cat
/var/lib/ceph/osd/ceph-999/fsid) 999)

Step 5 and 6 are problematic if you are going to be trying ceph-volume
later on, which takes care of doing this for you.

6. add the OSD secret to Ceph authentication (osd-node # ceph auth add
osd.999 mgr 'allow profile osd' osd 'allow *' mon 'allow profile osd' -i
/var/lib/ceph/osd/ceph-999/keyring)

I at first tried to follow the documented steps (without my steps 5 and 6), which did not work for me. The documented approach failed with "init authentication >> failed: (1) Operation not permitted", because actually ceph-volume did not add the auth entry for me.

But even after manually adding the authentication, the "ceph-volume" approach failed, as the OSD was still marked "destroyed" in the osdmap epoch as used by ceph-osd (see the commented messages from ceph-osd.999.log below).

7. prepare the new OSD (osd-node # ceph-volume lvm prepare --bluestore
--osd-id 999 --data /dev/sdzz)

You are going to hit a bug in ceph-volume that is preventing you from
specifying the osd id directly if the ID has been destroyed.

See http://tracker.ceph.com/issues/22642

If I read that bug description correctly, you're confirming why I needed step #6 above (manually adding the OSD auth entry. But even if ceph-volume had added it, the ceph-osd.log entries suggest that starting the OSD would still have failed, because of accessing the wrong osdmap epoch.

To me it seems like I'm hitting a bug outside of ceph-volume - unless it's ceph-volume that somehow determines which osdmap epoch is used by ceph-osd.

In order for this to work, you would need to make sure that the ID has
really been destroyed and avoid passing --osd-id in ceph-volume. The
caveat
being that you will get whatever ID is available next in the cluster.

Yes, that's the work-around I then used - purge the old OSD and create a new one.

Thanks & regards,
Jens

[...]
--- cut here ---
# first of multiple attempts, before "ceph auth add ..."
# no actual epoch referenced, as login failed due to missing auth
2018-01-10 00:00:02.173983 7f5cf1c89d00  0 osd.999 0 crush map has features
288232575208783872, adjusting msgr requires for clients
2018-01-10 00:00:02.173990 7f5cf1c89d00  0 osd.999 0 crush map has features
288232575208783872 was 8705, adjusting msgr requires for mons
2018-01-10 00:00:02.173994 7f5cf1c89d00  0 osd.999 0 crush map has features
288232575208783872, adjusting msgr requires for osds
2018-01-10 00:00:02.174046 7f5cf1c89d00  0 osd.999 0 load_pgs
2018-01-10 00:00:02.174051 7f5cf1c89d00  0 osd.999 0 load_pgs opened 0 pgs
2018-01-10 00:00:02.174055 7f5cf1c89d00  0 osd.999 0 using weightedpriority
op queue with priority op cut off at 64.
2018-01-10 00:00:02.174891 7f5cf1c89d00 -1 osd.999 0 log_to_monitors
{default=true}
2018-01-10 00:00:02.177479 7f5cf1c89d00 -1 osd.999 0 init authentication
failed: (1) Operation not permitted

# after "ceph auth ..."
# note the different epochs below? BTW, 110587 is the current epoch at that
time and osd.999 is marked destroyed there
# 109892: much too old to offer any details
# 110587: modified 2018-01-09 23:43:13.202381

2018-01-10 00:08:00.945507 7fc55905bd00  0 osd.999 0 crush map has features
288232575208783872, adjusting msgr requires for clients
2018-01-10 00:08:00.945514 7fc55905bd00  0 osd.999 0 crush map has features
288232575208783872 was 8705, adjusting msgr requires for mons
2018-01-10 00:08:00.945521 7fc55905bd00  0 osd.999 0 crush map has features
288232575208783872, adjusting msgr requires for osds
2018-01-10 00:08:00.945588 7fc55905bd00  0 osd.999 0 load_pgs
2018-01-10 00:08:00.945594 7fc55905bd00  0 osd.999 0 load_pgs opened 0 pgs
2018-01-10 00:08:00.945599 7fc55905bd00  0 osd.999 0 using weightedpriority
op queue with priority op cut off at 64.
2018-01-10 00:08:00.946544 7fc55905bd00 -1 osd.999 0 log_to_monitors
{default=true}
2018-01-10 00:08:00.951720 7fc55905bd00  0 osd.999 0 done with init,
starting boot process
2018-01-10 00:08:00.952225 7fc54160a700 -1 osd.999 0 waiting for initial
osdmap
2018-01-10 00:08:00.970644 7fc546614700  0 osd.999 109892 crush map has
features 288232610642264064, adjusting msgr requires for clients
2018-01-10 00:08:00.970653 7fc546614700  0 osd.999 109892 crush map has
features 288232610642264064 was 288232575208792577, adjusting msgr requires
for mons
2018-01-10 00:08:00.970660 7fc546614700  0 osd.999 109892 crush map has
features 1008808551021559808, adjusting msgr requires for osds
2018-01-10 00:08:01.349602 7fc546614700 -1 osd.999 110587 osdmap says I am
destroyed, exiting

# another try
# it is now using epoch 110587 for everything. But that one is off by one at
that time already:
# 110587: modified 2018-01-09 23:43:13.202381
# 110588: modified 2018-01-10 00:12:55.271913

# but both 110587 and 110588 have osd.999 as "destroyed", so never mind.
2018-01-10 00:13:04.332026 7f408d5a4d00  0 osd.999 110587 crush map has
features 288232610642264064, adjusting msgr requires for clients
2018-01-10 00:13:04.332037 7f408d5a4d00  0 osd.999 110587 crush map has
features 288232610642264064 was 8705, adjusting msgr requires for mons
2018-01-10 00:13:04.332043 7f408d5a4d00  0 osd.999 110587 crush map has
features 1008808551021559808, adjusting msgr requires for osds
2018-01-10 00:13:04.332092 7f408d5a4d00  0 osd.999 110587 load_pgs
2018-01-10 00:13:04.332096 7f408d5a4d00  0 osd.999 110587 load_pgs opened 0
pgs
2018-01-10 00:13:04.332100 7f408d5a4d00  0 osd.999 110587 using
weightedpriority op queue with priority op cut off at 64.
2018-01-10 00:13:04.332990 7f408d5a4d00 -1 osd.999 110587 log_to_monitors
{default=true}
2018-01-10 00:13:06.026628 7f408d5a4d00  0 osd.999 110587 done with init,
starting boot process
2018-01-10 00:13:06.027627 7f4075352700 -1 osd.999 110587 osdmap says I am
destroyed, exiting

# the attempt after using "ceph osd new", which created epoch 110591 as the
first with osd.999 as autoout,exists,new
# But ceph-osd still uses 110587.
# 110587: modified 2018-01-09 23:43:13.202381
# 110591: modified 2018-01-10 00:30:44.850078

2018-01-10 00:31:15.453871 7f1c57c58d00  0 osd.999 110587 crush map has
features 288232610642264064, adjusting msgr requires for clients
2018-01-10 00:31:15.453882 7f1c57c58d00  0 osd.999 110587 crush map has
features 288232610642264064 was 8705, adjusting msgr requires for mons
2018-01-10 00:31:15.453887 7f1c57c58d00  0 osd.999 110587 crush map has
features 1008808551021559808, adjusting msgr requires for osds
2018-01-10 00:31:15.453940 7f1c57c58d00  0 osd.999 110587 load_pgs
2018-01-10 00:31:15.453945 7f1c57c58d00  0 osd.999 110587 load_pgs opened 0
pgs
2018-01-10 00:31:15.453952 7f1c57c58d00  0 osd.999 110587 using
weightedpriority op queue with priority op cut off at 64.
2018-01-10 00:31:15.454862 7f1c57c58d00 -1 osd.999 110587 log_to_monitors
{default=true}
2018-01-10 00:31:15.520533 7f1c57c58d00  0 osd.999 110587 done with init,
starting boot process
2018-01-10 00:31:15.521278 7f1c40207700 -1 osd.999 110587 osdmap says I am
destroyed, exiting
--- cut here ---
[...]

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com