Re: Proper procedure to replace DB/WAL SSD

Dietmar Rieder <dietmar.rieder@xxxxxxxxxxx> · Tue, 27 Feb 2018 17:47:54 +0100

Thanks for making this clear.

Dietmar

On 02/27/2018 05:29 PM, Alfredo Deza wrote:
> On Tue, Feb 27, 2018 at 11:13 AM, Dietmar Rieder
> <dietmar.rieder@xxxxxxxxxxx> wrote:
>> ... however, it would be nice if ceph-volume would also create the
>> partitions for the WAL and/or DB if needed. Is there a special reason,
>> why this is not implemented?
> 
> Yes, the reason is that this was one of the most painful points in
> ceph-disk (code and maintenance-wise): to be in the business of
> understanding partitions, sizes, requirements, and devices
> is non-trivial.
> 
> One of the reasons ceph-disk did this was because it required quite a
> hefty amount of "special sauce" on partitions so that these would be
> discovered later by mechanisms that included udev.
> 
> If an admin wanted more flexibility, we decided that it had to be up
> to configuration management system (or whatever deployment mechanism)
> to do so. For users that want a simplistic approach (in the case of
> bluestore)
> we have a 1:1 mapping for device->logical volume->OSD
> 
> On the ceph-volume side as well, implementing partitions meant to also
> have a similar support for logical volumes, which have lots of
> variations that can be supported and we were not willing to attempt to
> support them all.
> 
> Even a small subset would inevitably bring up the question of "why is
> setup X not supported by ceph-volume if setup Y is?"
> 
> Configuration management systems are better suited for handling these
> situations, and we would prefer to offload that responsibility to
> those systems.
> 
>>
>> Dietmar
>>
>>
>> On 02/27/2018 04:25 PM, David Turner wrote:
>>> Gotcha.  As a side note, that setting is only used by ceph-disk as
>>> ceph-volume does not create partitions for the WAL or DB.  You need to
>>> create those partitions manually if using anything other than a whole
>>> block device when creating OSDs with ceph-volume.
>>>
>>> On Tue, Feb 27, 2018 at 8:20 AM Caspar Smit <casparsmit@xxxxxxxxxxx
>>> <mailto:casparsmit@xxxxxxxxxxx>> wrote:
>>>
>>>     David,
>>>
>>>     Yes i know, i use 20GB partitions for 2TB disks as journal. It was
>>>     just to inform other people that Ceph's default of 1GB is pretty low.
>>>     Now that i read my own sentence it indeed looks as if i was using
>>>     1GB partitions, sorry for the confusion.
>>>
>>>     Caspar
>>>
>>>     2018-02-27 14:11 GMT+01:00 David Turner <drakonstein@xxxxxxxxx
>>>     <mailto:drakonstein@xxxxxxxxx>>:
>>>
>>>         If you're only using a 1GB DB partition, there is a very real
>>>         possibility it's already 100% full. The safe estimate for DB
>>>         size seams to be 10GB/1TB so for a 4TB osd a 40GB DB should work
>>>         for most use cases (except loads and loads of small files).
>>>         There are a few threads that mention how to check how much of
>>>         your DB partition is in use. Once it's full, it spills over to
>>>         the HDD.
>>>
>>>
>>>         On Tue, Feb 27, 2018, 6:19 AM Caspar Smit
>>>         <casparsmit@xxxxxxxxxxx <mailto:casparsmit@xxxxxxxxxxx>> wrote:
>>>
>>>             2018-02-26 23:01 GMT+01:00 Gregory Farnum
>>>             <gfarnum@xxxxxxxxxx <mailto:gfarnum@xxxxxxxxxx>>:
>>>
>>>                 On Mon, Feb 26, 2018 at 3:23 AM Caspar Smit
>>>                 <casparsmit@xxxxxxxxxxx <mailto:casparsmit@xxxxxxxxxxx>>
>>>                 wrote:
>>>
>>>                     2018-02-24 7:10 GMT+01:00 David Turner
>>>                     <drakonstein@xxxxxxxxx <mailto:drakonstein@xxxxxxxxx>>:
>>>
>>>                         Caspar, it looks like your idea should work.
>>>                         Worst case scenario seems like the osd wouldn't
>>>                         start, you'd put the old SSD back in and go back
>>>                         to the idea to weight them to 0, backfilling,
>>>                         then recreate the osds. Definitely with a try in
>>>                         my opinion, and I'd love to hear your experience
>>>                         after.
>>>
>>>
>>>                     Hi David,
>>>
>>>                     First of all, thank you for ALL your answers on this
>>>                     ML, you're really putting a lot of effort into
>>>                     answering many questions asked here and very often
>>>                     they contain invaluable information.
>>>
>>>
>>>                     To follow up on this post i went out and built a
>>>                     very small (proxmox) cluster (3 OSD's per host) to
>>>                     test my suggestion of cloning the DB/WAL SDD. And it
>>>                     worked!
>>>                     Note: this was on Luminous v12.2.2 (all bluestore,
>>>                     ceph-disk based OSD's)
>>>
>>>                     Here's what i did on 1 node:
>>>
>>>                     1) ceph osd set noout
>>>                     2) systemctl stop osd.0; systemctl stop
>>>                     osd.1; systemctl stop osd.2
>>>                     3) ddrescue -f -n -vv <old SSD dev> <new SSD dev>
>>>                     /root/clone-db.log
>>>                     4) removed the old SSD physically from the node
>>>                     5) checked with "ceph -s" and already saw HEALTH_OK
>>>                     and all OSD's up/in
>>>                     6) ceph osd unset noout
>>>
>>>                     I assume that once the ddrescue step is finished a
>>>                     'partprobe' or something similar is triggered and
>>>                     udev finds the DB partitions on the new SSD and
>>>                     starts the OSD's again (kind of what happens during
>>>                     hotplug)
>>>                     So it is probably better to clone the SSD in another
>>>                     (non-ceph) system to not trigger any udev events.
>>>
>>>                     I also tested a reboot after this and everything
>>>                     still worked.
>>>
>>>
>>>                     The old SSD was 120GB and the new is 256GB (cloning
>>>                     took around 4 minutes)
>>>                     Delta of data was very low because it was a test
>>>                     cluster.
>>>
>>>                     All in all the OSD's in question were 'down' for
>>>                     only 5 minutes (so i stayed within the
>>>                     ceph_osd_down_out interval of the default 10 minutes
>>>                     and didn't actually need to set noout :)
>>>
>>>
>>>                 I kicked off a brief discussion about this with some of
>>>                 the BlueStore guys and they're aware of the problem with
>>>                 migrating across SSDs, but so far it's just a Trello
>>>                 card: https://trello.com/c/9cxTgG50/324-bluestore-add-remove-resize-wal-db
>>>                 They do confirm you should be okay with dd'ing things
>>>                 across, assuming symlinks get set up correctly as David
>>>                 noted.
>>>
>>>
>>>             Great that it is on the radar to address. This method feels
>>>             hacky.
>>>
>>>
>>>                 I've got some other bad news, though: BlueStore has
>>>                 internal metadata about the size of the block device
>>>                 it's using, so if you copy it onto a larger block
>>>                 device, it will not actually make use of the additional
>>>                 space. :(
>>>                 -Greg
>>>
>>>
>>>             Yes, i was well aware of that, no problem. The reason was
>>>             the smaller SSD sizes are simply not being made anymore or
>>>             discontinued by the manufacturer.
>>>             Would be nice though if the DB size could be resized in the
>>>             future, the default 1GB DB size seems very small to me.
>>>
>>>             Caspar
>>>
>>>
>>>
>>>
>>>
>>>                     Kind regards,
>>>                     Caspar
>>>
>>>
>>>
>>>                         Nico, it is not possible to change the WAL or DB
>>>                         size, location, etc after osd creation. If you
>>>                         want to change the configuration of the osd
>>>                         after creation, you have to remove it from the
>>>                         cluster and recreate it. There is no similar
>>>                         functionality to how you could move, recreate,
>>>                         etc filesystem osd journals. I think this might
>>>                         be on the radar as a feature, but I don't know
>>>                         for certain. I definitely consider it to be a
>>>                         regression of bluestore.
>>>
>>>
>>>
>>>
>>>                         On Fri, Feb 23, 2018, 9:13 AM Nico Schottelius
>>>                         <nico.schottelius@xxxxxxxxxxx
>>>                         <mailto:nico.schottelius@xxxxxxxxxxx>> wrote:
>>>
>>>
>>>                             A very interesting question and I would add
>>>                             the follow up question:
>>>
>>>                             Is there an easy way to add an external
>>>                             DB/WAL devices to an existing
>>>                             OSD?
>>>
>>>                             I suspect that it might be something on the
>>>                             lines of:
>>>
>>>                             - stop osd
>>>                             - create a link in
>>>                             ...ceph/osd/ceph-XX/block.db to the target
>>>                             device
>>>                             - (maybe run some kind of osd mkfs ?)
>>>                             - start osd
>>>
>>>                             Has anyone done this so far or
>>>                             recommendations on how to do it?
>>>
>>>                             Which also makes me wonder: what is actually
>>>                             the format of WAL and
>>>                             BlockDB in bluestore? Is there any
>>>                             documentation available about it?
>>>
>>>                             Best,
>>>
>>>                             Nico
>>>
>>>
>>>                             Caspar Smit <casparsmit@xxxxxxxxxxx
>>>                             <mailto:casparsmit@xxxxxxxxxxx>> writes:
>>>
>>>                             > Hi All,
>>>                             >
>>>                             > What would be the proper way to
>>>                             preventively replace a DB/WAL SSD (when it
>>>                             > is nearing it's DWPD/TBW limit and not
>>>                             failed yet).
>>>                             >
>>>                             > It hosts DB partitions for 5 OSD's
>>>                             >
>>>                             > Maybe something like:
>>>                             >
>>>                             > 1) ceph osd reweight 0 the 5 OSD's
>>>                             > 2) let backfilling complete
>>>                             > 3) destroy/remove the 5 OSD's
>>>                             > 4) replace SSD
>>>                             > 5) create 5 new OSD's with seperate DB
>>>                             partition on new SSD
>>>                             >
>>>                             > When these 5 OSD's are big HDD's (8TB) a
>>>                             LOT of data has to be moved so i
>>>                             > thought maybe the following would work:
>>>                             >
>>>                             > 1) ceph osd set noout
>>>                             > 2) stop the 5 OSD's (systemctl stop)
>>>                             > 3) 'dd' the old SSD to a new SSD of same
>>>                             or bigger size
>>>                             > 4) remove the old SSD
>>>                             > 5) start the 5 OSD's (systemctl start)
>>>                             > 6) let backfilling/recovery complete (only
>>>                             delta data between OSD stop and
>>>                             > now)
>>>                             > 6) ceph osd unset noout
>>>                             >
>>>                             > Would this be a viable method to replace a
>>>                             DB SSD? Any udev/serial nr/uuid
>>>                             > stuff preventing this to work?
>>>                             >
>>>                             > Or is there another 'less hacky' way to
>>>                             replace a DB SSD without moving too
>>>                             > much data?
>>>                             >
>>>                             > Kind regards,
>>>                             > Caspar
>>>                             >
>>>                             _______________________________________________
>>>                             > ceph-users mailing list
>>>                             > ceph-users@xxxxxxxxxxxxxx
>>>                             <mailto:ceph-users@xxxxxxxxxxxxxx>
>>>                             >
>>>                             http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>                             --
>>>                             Modern, affordable, Swiss Virtual Machines.
>>>                             Visit www.datacenterlight.ch
>>>                             <http://www.datacenterlight.ch>
>>>                             _______________________________________________
>>>                             ceph-users mailing list
>>>                             ceph-users@xxxxxxxxxxxxxx
>>>                             <mailto:ceph-users@xxxxxxxxxxxxxx>
>>>                             http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>                     _______________________________________________
>>>                     ceph-users mailing list
>>>                     ceph-users@xxxxxxxxxxxxxx
>>>                     <mailto:ceph-users@xxxxxxxxxxxxxx>
>>>                     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com