Re: Proper procedure to replace DB/WAL SSD

Caspar Smit <casparsmit@xxxxxxxxxxx> · Tue, 27 Feb 2018 14:20:39 +0100

David,
Yes i know, i use 20GB partitions for 2TB disks as journal. It was just to inform other people that Ceph's default of 1GB is pretty low.
Now that i read my own sentence it indeed looks as if i was using 1GB partitions, sorry for the confusion.

Caspar

2018-02-27 14:11 GMT+01:00 David Turner <drakonstein@xxxxxxxxx>:
If you're only using a 1GB DB partition, there is a very real possibility it's already 100% full. The safe estimate for DB size seams to be 10GB/1TB so for a 4TB osd a 40GB DB should work for most use cases (except loads and loads of small files). There are a few threads that mention how to check how much of your DB partition is in use. Once it's full, it spills over to the HDD.

On Tue, Feb 27, 2018, 6:19 AM Caspar Smit <casparsmit@xxxxxxxxxxx> wrote:
2018-02-26 23:01 GMT+01:00 Gregory Farnum <gfarnum@xxxxxxxxxx>:
On Mon, Feb 26, 2018 at 3:23 AM Caspar Smit <casparsmit@xxxxxxxxxxx> wrote:
2018-02-24 7:10 GMT+01:00 David Turner <drakonstein@xxxxxxxxx>:
Caspar, it looks like your idea should work. Worst case scenario seems like the osd wouldn't start, you'd put the old SSD back in and go back to the idea to weight them to 0, backfilling, then recreate the osds. Definitely with a try in my opinion, and I'd love to hear your experience after.

Hi David,

First of all, thank you for ALL your answers on this ML, you're really putting a lot of effort into answering many questions asked here and very often they contain invaluable information.

To follow up on this post i went out and built a very small (proxmox) cluster (3 OSD's per host) to test my suggestion of cloning the DB/WAL SDD. And it worked!
Note: this was on Luminous v12.2.2 (all bluestore, ceph-disk based OSD's)

Here's what i did on 1 node:

1) ceph osd set noout
2) systemctl stop osd.0; systemctl stop osd.1; systemctl stop osd.2
3) ddrescue -f -n -vv <old SSD dev> <new SSD dev> /root/clone-db.log
4) removed the old SSD physically from the node
5) checked with "ceph -s" and already saw HEALTH_OK and all OSD's up/in
6) ceph osd unset noout

I assume that once the ddrescue step is finished a 'partprobe' or something similar is triggered and udev finds the DB partitions on the new SSD and starts the OSD's again (kind of what happens during hotplug)
So it is probably better to clone the SSD in another (non-ceph) system to not trigger any udev events.

I also tested a reboot after this and everything still worked.

The old SSD was 120GB and the new is 256GB (cloning took around 4 minutes)
Delta of data was very low because it was a test cluster.

All in all the OSD's in question were 'down' for only 5 minutes (so i stayed within the ceph_osd_down_out interval of the default 10 minutes and didn't actually need to set noout :)

I kicked off a brief discussion about this with some of the BlueStore guys and they're aware of the problem with migrating across SSDs, but so far it's just a Trello card: https://trello.com/c/9cxTgG50/324-bluestore-add-remove-resize-wal-db
They do confirm you should be okay with dd'ing things across, assuming symlinks get set up correctly as David noted.

Great that it is on the radar to address. This method feels hacky.

I've got some other bad news, though: BlueStore has internal metadata about the size of the block device it's using, so if you copy it onto a larger block device, it will not actually make use of the additional space. :(
-Greg

Yes, i was well aware of that, no problem. The reason was the smaller SSD sizes are simply not being made anymore or discontinued by the manufacturer.
Would be nice though if the DB size could be resized in the future, the default 1GB DB size seems very small to me. 

Caspar

Kind regards,
Caspar

Nico, it is not possible to change the WAL or DB size, location, etc after osd creation. If you want to change the configuration of the osd after creation, you have to remove it from the cluster and recreate it. There is no similar functionality to how you could move, recreate, etc filesystem osd journals. I think this might be on the radar as a feature, but I don't know for certain. I definitely consider it to be a regression of bluestore.

On Fri, Feb 23, 2018, 9:13 AM Nico Schottelius <nico.schottelius@xxxxxxxxxxx> wrote:

A very interesting question and I would add the follow up question:

Is there an easy way to add an external DB/WAL devices to an existing

OSD?

I suspect that it might be something on the lines of:

- stop osd

- create a link in ...ceph/osd/ceph-XX/block.db to the target device

- (maybe run some kind of osd mkfs ?)

- start osd

Has anyone done this so far or recommendations on how to do it?

Which also makes me wonder: what is actually the format of WAL and

BlockDB in bluestore? Is there any documentation available about it?

Best,

Nico

Caspar Smit <casparsmit@xxxxxxxxxxx> writes:

> Hi All,

>

> What would be the proper way to preventively replace a DB/WAL SSD (when it

> is nearing it's DWPD/TBW limit and not failed yet).

>

> It hosts DB partitions for 5 OSD's

>

> Maybe something like:

>

> 1) ceph osd reweight 0 the 5 OSD's

> 2) let backfilling complete

> 3) destroy/remove the 5 OSD's

> 4) replace SSD

> 5) create 5 new OSD's with seperate DB partition on new SSD

>

> When these 5 OSD's are big HDD's (8TB) a LOT of data has to be moved so i

> thought maybe the following would work:

>

> 1) ceph osd set noout

> 2) stop the 5 OSD's (systemctl stop)

> 3) 'dd' the old SSD to a new SSD of same or bigger size

> 4) remove the old SSD

> 5) start the 5 OSD's (systemctl start)

> 6) let backfilling/recovery complete (only delta data between OSD stop and

> now)

> 6) ceph osd unset noout

>

> Would this be a viable method to replace a DB SSD? Any udev/serial nr/uuid

> stuff preventing this to work?

>

> Or is there another 'less hacky' way to replace a DB SSD without moving too

> much data?

>

> Kind regards,

> Caspar

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--

Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com