Re: Proper procedure to replace DB/WAL SSD

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



2018-02-26 18:02 GMT+01:00 David Turner <drakonstein@xxxxxxxxx>:
I'm glad that I was able to help out.  I wanted to point out that the reason those steps worked for you as quickly as they did is likely that you configured your blocks.db to use the /dev/disk/by-partuuid/{guid} instead of /dev/sdx#.  Had you configured your osds with /dev/sdx#, then you would have needed to either modify them to point to the partuuid path or changed them to the new devices name (which is a bad name as it will likely change on reboot).  Changing your path for blocks.db is as simple as `ln -sf /var/lib/ceph/osd/ceph-#/blocks.db /dev/disk/by-partuuid/{uuid}` and then restarting the osd to make sure that it can read from the new symlink location.


Yes, i (proxmox) used  /dev/disk/by-partuuid/{guid} style links.
 
I'm curious about your OSDs starting automatically after doing those steps as well.  I would guess you deployed them with ceph-disk instead of ceph-volume, is that right?  ceph-volume no longer uses udev rules and shouldn't have picked up these changes here.


Yes, ceph-disk based so udev kicked in on the partprobe.

Caspar
 
On Mon, Feb 26, 2018 at 6:23 AM Caspar Smit <casparsmit@xxxxxxxxxxx> wrote:
2018-02-24 7:10 GMT+01:00 David Turner <drakonstein@xxxxxxxxx>:
Caspar, it looks like your idea should work. Worst case scenario seems like the osd wouldn't start, you'd put the old SSD back in and go back to the idea to weight them to 0, backfilling, then recreate the osds. Definitely with a try in my opinion, and I'd love to hear your experience after.


Hi David,

First of all, thank you for ALL your answers on this ML, you're really putting a lot of effort into answering many questions asked here and very often they contain invaluable information.


To follow up on this post i went out and built a very small (proxmox) cluster (3 OSD's per host) to test my suggestion of cloning the DB/WAL SDD. And it worked!
Note: this was on Luminous v12.2.2 (all bluestore, ceph-disk based OSD's)

Here's what i did on 1 node:

1) ceph osd set noout
2) systemctl stop osd.0; systemctl stop osd.1; systemctl stop osd.2
3) ddrescue -f -n -vv <old SSD dev> <new SSD dev> /root/clone-db.log
4) removed the old SSD physically from the node
5) checked with "ceph -s" and already saw HEALTH_OK and all OSD's up/in
6) ceph osd unset noout

I assume that once the ddrescue step is finished a 'partprobe' or something similar is triggered and udev finds the DB partitions on the new SSD and starts the OSD's again (kind of what happens during hotplug)
So it is probably better to clone the SSD in another (non-ceph) system to not trigger any udev events.

I also tested a reboot after this and everything still worked.


The old SSD was 120GB and the new is 256GB (cloning took around 4 minutes)
Delta of data was very low because it was a test cluster.

All in all the OSD's in question were 'down' for only 5 minutes (so i stayed within the ceph_osd_down_out interval of the default 10 minutes and didn't actually need to set noout :)

Kind regards,
Caspar

 
Nico, it is not possible to change the WAL or DB size, location, etc after osd creation. If you want to change the configuration of the osd after creation, you have to remove it from the cluster and recreate it. There is no similar functionality to how you could move, recreate, etc filesystem osd journals. I think this might be on the radar as a feature, but I don't know for certain. I definitely consider it to be a regression of bluestore.




On Fri, Feb 23, 2018, 9:13 AM Nico Schottelius <nico.schottelius@xxxxxxxxxxx> wrote:

A very interesting question and I would add the follow up question:

Is there an easy way to add an external DB/WAL devices to an existing
OSD?

I suspect that it might be something on the lines of:

- stop osd
- create a link in ...ceph/osd/ceph-XX/block.db to the target device
- (maybe run some kind of osd mkfs ?)
- start osd

Has anyone done this so far or recommendations on how to do it?

Which also makes me wonder: what is actually the format of WAL and
BlockDB in bluestore? Is there any documentation available about it?

Best,

Nico


Caspar Smit <casparsmit@xxxxxxxxxxx> writes:

> Hi All,
>
> What would be the proper way to preventively replace a DB/WAL SSD (when it
> is nearing it's DWPD/TBW limit and not failed yet).
>
> It hosts DB partitions for 5 OSD's
>
> Maybe something like:
>
> 1) ceph osd reweight 0 the 5 OSD's
> 2) let backfilling complete
> 3) destroy/remove the 5 OSD's
> 4) replace SSD
> 5) create 5 new OSD's with seperate DB partition on new SSD
>
> When these 5 OSD's are big HDD's (8TB) a LOT of data has to be moved so i
> thought maybe the following would work:
>
> 1) ceph osd set noout
> 2) stop the 5 OSD's (systemctl stop)
> 3) 'dd' the old SSD to a new SSD of same or bigger size
> 4) remove the old SSD
> 5) start the 5 OSD's (systemctl start)
> 6) let backfilling/recovery complete (only delta data between OSD stop and
> now)
> 6) ceph osd unset noout
>
> Would this be a viable method to replace a DB SSD? Any udev/serial nr/uuid
> stuff preventing this to work?
>
> Or is there another 'less hacky' way to replace a DB SSD without moving too
> much data?
>
> Kind regards,
> Caspar
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux