Re: Proper procedure to replace DB/WAL SSD

Alfredo Deza <adeza@xxxxxxxxxx> · Tue, 27 Feb 2018 11:29:48 -0500

On Tue, Feb 27, 2018 at 11:13 AM, Dietmar Rieder
<dietmar.rieder@xxxxxxxxxxx> wrote:
> ... however, it would be nice if ceph-volume would also create the
> partitions for the WAL and/or DB if needed. Is there a special reason,
> why this is not implemented?

Yes, the reason is that this was one of the most painful points in
ceph-disk (code and maintenance-wise): to be in the business of
understanding partitions, sizes, requirements, and devices
is non-trivial.

One of the reasons ceph-disk did this was because it required quite a
hefty amount of "special sauce" on partitions so that these would be
discovered later by mechanisms that included udev.

If an admin wanted more flexibility, we decided that it had to be up
to configuration management system (or whatever deployment mechanism)
to do so. For users that want a simplistic approach (in the case of
bluestore)
we have a 1:1 mapping for device->logical volume->OSD

On the ceph-volume side as well, implementing partitions meant to also
have a similar support for logical volumes, which have lots of
variations that can be supported and we were not willing to attempt to
support them all.

Even a small subset would inevitably bring up the question of "why is
setup X not supported by ceph-volume if setup Y is?"

Configuration management systems are better suited for handling these
situations, and we would prefer to offload that responsibility to
those systems.

>
> Dietmar
>
>
> On 02/27/2018 04:25 PM, David Turner wrote:
>> Gotcha.  As a side note, that setting is only used by ceph-disk as
>> ceph-volume does not create partitions for the WAL or DB.  You need to
>> create those partitions manually if using anything other than a whole
>> block device when creating OSDs with ceph-volume.
>>
>> On Tue, Feb 27, 2018 at 8:20 AM Caspar Smit <casparsmit@xxxxxxxxxxx
>> <mailto:casparsmit@xxxxxxxxxxx>> wrote:
>>
>>     David,
>>
>>     Yes i know, i use 20GB partitions for 2TB disks as journal. It was
>>     just to inform other people that Ceph's default of 1GB is pretty low.
>>     Now that i read my own sentence it indeed looks as if i was using
>>     1GB partitions, sorry for the confusion.
>>
>>     Caspar
>>
>>     2018-02-27 14:11 GMT+01:00 David Turner <drakonstein@xxxxxxxxx
>>     <mailto:drakonstein@xxxxxxxxx>>:
>>
>>         If you're only using a 1GB DB partition, there is a very real
>>         possibility it's already 100% full. The safe estimate for DB
>>         size seams to be 10GB/1TB so for a 4TB osd a 40GB DB should work
>>         for most use cases (except loads and loads of small files).
>>         There are a few threads that mention how to check how much of
>>         your DB partition is in use. Once it's full, it spills over to
>>         the HDD.
>>
>>
>>         On Tue, Feb 27, 2018, 6:19 AM Caspar Smit
>>         <casparsmit@xxxxxxxxxxx <mailto:casparsmit@xxxxxxxxxxx>> wrote:
>>
>>             2018-02-26 23:01 GMT+01:00 Gregory Farnum
>>             <gfarnum@xxxxxxxxxx <mailto:gfarnum@xxxxxxxxxx>>:
>>
>>                 On Mon, Feb 26, 2018 at 3:23 AM Caspar Smit
>>                 <casparsmit@xxxxxxxxxxx <mailto:casparsmit@xxxxxxxxxxx>>
>>                 wrote:
>>
>>                     2018-02-24 7:10 GMT+01:00 David Turner
>>                     <drakonstein@xxxxxxxxx <mailto:drakonstein@xxxxxxxxx>>:
>>
>>                         Caspar, it looks like your idea should work.
>>                         Worst case scenario seems like the osd wouldn't
>>                         start, you'd put the old SSD back in and go back
>>                         to the idea to weight them to 0, backfilling,
>>                         then recreate the osds. Definitely with a try in
>>                         my opinion, and I'd love to hear your experience
>>                         after.
>>
>>
>>                     Hi David,
>>
>>                     First of all, thank you for ALL your answers on this
>>                     ML, you're really putting a lot of effort into
>>                     answering many questions asked here and very often
>>                     they contain invaluable information.
>>
>>
>>                     To follow up on this post i went out and built a
>>                     very small (proxmox) cluster (3 OSD's per host) to
>>                     test my suggestion of cloning the DB/WAL SDD. And it
>>                     worked!
>>                     Note: this was on Luminous v12.2.2 (all bluestore,
>>                     ceph-disk based OSD's)
>>
>>                     Here's what i did on 1 node:
>>
>>                     1) ceph osd set noout
>>                     2) systemctl stop osd.0; systemctl stop
>>                     osd.1; systemctl stop osd.2
>>                     3) ddrescue -f -n -vv <old SSD dev> <new SSD dev>
>>                     /root/clone-db.log
>>                     4) removed the old SSD physically from the node
>>                     5) checked with "ceph -s" and already saw HEALTH_OK
>>                     and all OSD's up/in
>>                     6) ceph osd unset noout
>>
>>                     I assume that once the ddrescue step is finished a
>>                     'partprobe' or something similar is triggered and
>>                     udev finds the DB partitions on the new SSD and
>>                     starts the OSD's again (kind of what happens during
>>                     hotplug)
>>                     So it is probably better to clone the SSD in another
>>                     (non-ceph) system to not trigger any udev events.
>>
>>                     I also tested a reboot after this and everything
>>                     still worked.
>>
>>
>>                     The old SSD was 120GB and the new is 256GB (cloning
>>                     took around 4 minutes)
>>                     Delta of data was very low because it was a test
>>                     cluster.
>>
>>                     All in all the OSD's in question were 'down' for
>>                     only 5 minutes (so i stayed within the
>>                     ceph_osd_down_out interval of the default 10 minutes
>>                     and didn't actually need to set noout :)
>>
>>
>>                 I kicked off a brief discussion about this with some of
>>                 the BlueStore guys and they're aware of the problem with
>>                 migrating across SSDs, but so far it's just a Trello
>>                 card: https://trello.com/c/9cxTgG50/324-bluestore-add-remove-resize-wal-db
>>                 They do confirm you should be okay with dd'ing things
>>                 across, assuming symlinks get set up correctly as David
>>                 noted.
>>
>>
>>             Great that it is on the radar to address. This method feels
>>             hacky.
>>
>>
>>                 I've got some other bad news, though: BlueStore has
>>                 internal metadata about the size of the block device
>>                 it's using, so if you copy it onto a larger block
>>                 device, it will not actually make use of the additional
>>                 space. :(
>>                 -Greg
>>
>>
>>             Yes, i was well aware of that, no problem. The reason was
>>             the smaller SSD sizes are simply not being made anymore or
>>             discontinued by the manufacturer.
>>             Would be nice though if the DB size could be resized in the
>>             future, the default 1GB DB size seems very small to me.
>>
>>             Caspar
>>
>>
>>
>>
>>
>>                     Kind regards,
>>                     Caspar
>>
>>
>>
>>                         Nico, it is not possible to change the WAL or DB
>>                         size, location, etc after osd creation. If you
>>                         want to change the configuration of the osd
>>                         after creation, you have to remove it from the
>>                         cluster and recreate it. There is no similar
>>                         functionality to how you could move, recreate,
>>                         etc filesystem osd journals. I think this might
>>                         be on the radar as a feature, but I don't know
>>                         for certain. I definitely consider it to be a
>>                         regression of bluestore.
>>
>>
>>
>>
>>                         On Fri, Feb 23, 2018, 9:13 AM Nico Schottelius
>>                         <nico.schottelius@xxxxxxxxxxx
>>                         <mailto:nico.schottelius@xxxxxxxxxxx>> wrote:
>>
>>
>>                             A very interesting question and I would add
>>                             the follow up question:
>>
>>                             Is there an easy way to add an external
>>                             DB/WAL devices to an existing
>>                             OSD?
>>
>>                             I suspect that it might be something on the
>>                             lines of:
>>
>>                             - stop osd
>>                             - create a link in
>>                             ...ceph/osd/ceph-XX/block.db to the target
>>                             device
>>                             - (maybe run some kind of osd mkfs ?)
>>                             - start osd
>>
>>                             Has anyone done this so far or
>>                             recommendations on how to do it?
>>
>>                             Which also makes me wonder: what is actually
>>                             the format of WAL and
>>                             BlockDB in bluestore? Is there any
>>                             documentation available about it?
>>
>>                             Best,
>>
>>                             Nico
>>
>>
>>                             Caspar Smit <casparsmit@xxxxxxxxxxx
>>                             <mailto:casparsmit@xxxxxxxxxxx>> writes:
>>
>>                             > Hi All,
>>                             >
>>                             > What would be the proper way to
>>                             preventively replace a DB/WAL SSD (when it
>>                             > is nearing it's DWPD/TBW limit and not
>>                             failed yet).
>>                             >
>>                             > It hosts DB partitions for 5 OSD's
>>                             >
>>                             > Maybe something like:
>>                             >
>>                             > 1) ceph osd reweight 0 the 5 OSD's
>>                             > 2) let backfilling complete
>>                             > 3) destroy/remove the 5 OSD's
>>                             > 4) replace SSD
>>                             > 5) create 5 new OSD's with seperate DB
>>                             partition on new SSD
>>                             >
>>                             > When these 5 OSD's are big HDD's (8TB) a
>>                             LOT of data has to be moved so i
>>                             > thought maybe the following would work:
>>                             >
>>                             > 1) ceph osd set noout
>>                             > 2) stop the 5 OSD's (systemctl stop)
>>                             > 3) 'dd' the old SSD to a new SSD of same
>>                             or bigger size
>>                             > 4) remove the old SSD
>>                             > 5) start the 5 OSD's (systemctl start)
>>                             > 6) let backfilling/recovery complete (only
>>                             delta data between OSD stop and
>>                             > now)
>>                             > 6) ceph osd unset noout
>>                             >
>>                             > Would this be a viable method to replace a
>>                             DB SSD? Any udev/serial nr/uuid
>>                             > stuff preventing this to work?
>>                             >
>>                             > Or is there another 'less hacky' way to
>>                             replace a DB SSD without moving too
>>                             > much data?
>>                             >
>>                             > Kind regards,
>>                             > Caspar
>>                             >
>>                             _______________________________________________
>>                             > ceph-users mailing list
>>                             > ceph-users@xxxxxxxxxxxxxx
>>                             <mailto:ceph-users@xxxxxxxxxxxxxx>
>>                             >
>>                             http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>                             --
>>                             Modern, affordable, Swiss Virtual Machines.
>>                             Visit www.datacenterlight.ch
>>                             <http://www.datacenterlight.ch>
>>                             _______________________________________________
>>                             ceph-users mailing list
>>                             ceph-users@xxxxxxxxxxxxxx
>>                             <mailto:ceph-users@xxxxxxxxxxxxxx>
>>                             http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>                     _______________________________________________
>>                     ceph-users mailing list
>>                     ceph-users@xxxxxxxxxxxxxx
>>                     <mailto:ceph-users@xxxxxxxxxxxxxx>
>>                     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
> --
> _________________________________________
> D i e t m a r  R i e d e r, Mag.Dr.
> Innsbruck Medical University
> Biocenter - Division for Bioinformatics
> Innrain 80, 6020 Innsbruck
> Phone: +43 512 9003 71402
> Fax: +43 512 9003 73100
> Email: dietmar.rieder@xxxxxxxxxxx
> Web:   http://www.icbi.at
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com