Thanks for making this clear. Dietmar On 02/27/2018 05:29 PM, Alfredo Deza wrote: > On Tue, Feb 27, 2018 at 11:13 AM, Dietmar Rieder > <dietmar.rieder@xxxxxxxxxxx> wrote: >> ... however, it would be nice if ceph-volume would also create the >> partitions for the WAL and/or DB if needed. Is there a special reason, >> why this is not implemented? > > Yes, the reason is that this was one of the most painful points in > ceph-disk (code and maintenance-wise): to be in the business of > understanding partitions, sizes, requirements, and devices > is non-trivial. > > One of the reasons ceph-disk did this was because it required quite a > hefty amount of "special sauce" on partitions so that these would be > discovered later by mechanisms that included udev. > > If an admin wanted more flexibility, we decided that it had to be up > to configuration management system (or whatever deployment mechanism) > to do so. For users that want a simplistic approach (in the case of > bluestore) > we have a 1:1 mapping for device->logical volume->OSD > > On the ceph-volume side as well, implementing partitions meant to also > have a similar support for logical volumes, which have lots of > variations that can be supported and we were not willing to attempt to > support them all. > > Even a small subset would inevitably bring up the question of "why is > setup X not supported by ceph-volume if setup Y is?" > > Configuration management systems are better suited for handling these > situations, and we would prefer to offload that responsibility to > those systems. > >> >> Dietmar >> >> >> On 02/27/2018 04:25 PM, David Turner wrote: >>> Gotcha. As a side note, that setting is only used by ceph-disk as >>> ceph-volume does not create partitions for the WAL or DB. You need to >>> create those partitions manually if using anything other than a whole >>> block device when creating OSDs with ceph-volume. >>> >>> On Tue, Feb 27, 2018 at 8:20 AM Caspar Smit <casparsmit@xxxxxxxxxxx >>> <mailto:casparsmit@xxxxxxxxxxx>> wrote: >>> >>> David, >>> >>> Yes i know, i use 20GB partitions for 2TB disks as journal. It was >>> just to inform other people that Ceph's default of 1GB is pretty low. >>> Now that i read my own sentence it indeed looks as if i was using >>> 1GB partitions, sorry for the confusion. >>> >>> Caspar >>> >>> 2018-02-27 14:11 GMT+01:00 David Turner <drakonstein@xxxxxxxxx >>> <mailto:drakonstein@xxxxxxxxx>>: >>> >>> If you're only using a 1GB DB partition, there is a very real >>> possibility it's already 100% full. The safe estimate for DB >>> size seams to be 10GB/1TB so for a 4TB osd a 40GB DB should work >>> for most use cases (except loads and loads of small files). >>> There are a few threads that mention how to check how much of >>> your DB partition is in use. Once it's full, it spills over to >>> the HDD. >>> >>> >>> On Tue, Feb 27, 2018, 6:19 AM Caspar Smit >>> <casparsmit@xxxxxxxxxxx <mailto:casparsmit@xxxxxxxxxxx>> wrote: >>> >>> 2018-02-26 23:01 GMT+01:00 Gregory Farnum >>> <gfarnum@xxxxxxxxxx <mailto:gfarnum@xxxxxxxxxx>>: >>> >>> On Mon, Feb 26, 2018 at 3:23 AM Caspar Smit >>> <casparsmit@xxxxxxxxxxx <mailto:casparsmit@xxxxxxxxxxx>> >>> wrote: >>> >>> 2018-02-24 7:10 GMT+01:00 David Turner >>> <drakonstein@xxxxxxxxx <mailto:drakonstein@xxxxxxxxx>>: >>> >>> Caspar, it looks like your idea should work. >>> Worst case scenario seems like the osd wouldn't >>> start, you'd put the old SSD back in and go back >>> to the idea to weight them to 0, backfilling, >>> then recreate the osds. Definitely with a try in >>> my opinion, and I'd love to hear your experience >>> after. >>> >>> >>> Hi David, >>> >>> First of all, thank you for ALL your answers on this >>> ML, you're really putting a lot of effort into >>> answering many questions asked here and very often >>> they contain invaluable information. >>> >>> >>> To follow up on this post i went out and built a >>> very small (proxmox) cluster (3 OSD's per host) to >>> test my suggestion of cloning the DB/WAL SDD. And it >>> worked! >>> Note: this was on Luminous v12.2.2 (all bluestore, >>> ceph-disk based OSD's) >>> >>> Here's what i did on 1 node: >>> >>> 1) ceph osd set noout >>> 2) systemctl stop osd.0; systemctl stop >>> osd.1; systemctl stop osd.2 >>> 3) ddrescue -f -n -vv <old SSD dev> <new SSD dev> >>> /root/clone-db.log >>> 4) removed the old SSD physically from the node >>> 5) checked with "ceph -s" and already saw HEALTH_OK >>> and all OSD's up/in >>> 6) ceph osd unset noout >>> >>> I assume that once the ddrescue step is finished a >>> 'partprobe' or something similar is triggered and >>> udev finds the DB partitions on the new SSD and >>> starts the OSD's again (kind of what happens during >>> hotplug) >>> So it is probably better to clone the SSD in another >>> (non-ceph) system to not trigger any udev events. >>> >>> I also tested a reboot after this and everything >>> still worked. >>> >>> >>> The old SSD was 120GB and the new is 256GB (cloning >>> took around 4 minutes) >>> Delta of data was very low because it was a test >>> cluster. >>> >>> All in all the OSD's in question were 'down' for >>> only 5 minutes (so i stayed within the >>> ceph_osd_down_out interval of the default 10 minutes >>> and didn't actually need to set noout :) >>> >>> >>> I kicked off a brief discussion about this with some of >>> the BlueStore guys and they're aware of the problem with >>> migrating across SSDs, but so far it's just a Trello >>> card: https://trello.com/c/9cxTgG50/324-bluestore-add-remove-resize-wal-db >>> They do confirm you should be okay with dd'ing things >>> across, assuming symlinks get set up correctly as David >>> noted. >>> >>> >>> Great that it is on the radar to address. This method feels >>> hacky. >>> >>> >>> I've got some other bad news, though: BlueStore has >>> internal metadata about the size of the block device >>> it's using, so if you copy it onto a larger block >>> device, it will not actually make use of the additional >>> space. :( >>> -Greg >>> >>> >>> Yes, i was well aware of that, no problem. The reason was >>> the smaller SSD sizes are simply not being made anymore or >>> discontinued by the manufacturer. >>> Would be nice though if the DB size could be resized in the >>> future, the default 1GB DB size seems very small to me. >>> >>> Caspar >>> >>> >>> >>> >>> >>> Kind regards, >>> Caspar >>> >>> >>> >>> Nico, it is not possible to change the WAL or DB >>> size, location, etc after osd creation. If you >>> want to change the configuration of the osd >>> after creation, you have to remove it from the >>> cluster and recreate it. There is no similar >>> functionality to how you could move, recreate, >>> etc filesystem osd journals. I think this might >>> be on the radar as a feature, but I don't know >>> for certain. I definitely consider it to be a >>> regression of bluestore. >>> >>> >>> >>> >>> On Fri, Feb 23, 2018, 9:13 AM Nico Schottelius >>> <nico.schottelius@xxxxxxxxxxx >>> <mailto:nico.schottelius@xxxxxxxxxxx>> wrote: >>> >>> >>> A very interesting question and I would add >>> the follow up question: >>> >>> Is there an easy way to add an external >>> DB/WAL devices to an existing >>> OSD? >>> >>> I suspect that it might be something on the >>> lines of: >>> >>> - stop osd >>> - create a link in >>> ...ceph/osd/ceph-XX/block.db to the target >>> device >>> - (maybe run some kind of osd mkfs ?) >>> - start osd >>> >>> Has anyone done this so far or >>> recommendations on how to do it? >>> >>> Which also makes me wonder: what is actually >>> the format of WAL and >>> BlockDB in bluestore? Is there any >>> documentation available about it? >>> >>> Best, >>> >>> Nico >>> >>> >>> Caspar Smit <casparsmit@xxxxxxxxxxx >>> <mailto:casparsmit@xxxxxxxxxxx>> writes: >>> >>> > Hi All, >>> > >>> > What would be the proper way to >>> preventively replace a DB/WAL SSD (when it >>> > is nearing it's DWPD/TBW limit and not >>> failed yet). >>> > >>> > It hosts DB partitions for 5 OSD's >>> > >>> > Maybe something like: >>> > >>> > 1) ceph osd reweight 0 the 5 OSD's >>> > 2) let backfilling complete >>> > 3) destroy/remove the 5 OSD's >>> > 4) replace SSD >>> > 5) create 5 new OSD's with seperate DB >>> partition on new SSD >>> > >>> > When these 5 OSD's are big HDD's (8TB) a >>> LOT of data has to be moved so i >>> > thought maybe the following would work: >>> > >>> > 1) ceph osd set noout >>> > 2) stop the 5 OSD's (systemctl stop) >>> > 3) 'dd' the old SSD to a new SSD of same >>> or bigger size >>> > 4) remove the old SSD >>> > 5) start the 5 OSD's (systemctl start) >>> > 6) let backfilling/recovery complete (only >>> delta data between OSD stop and >>> > now) >>> > 6) ceph osd unset noout >>> > >>> > Would this be a viable method to replace a >>> DB SSD? Any udev/serial nr/uuid >>> > stuff preventing this to work? >>> > >>> > Or is there another 'less hacky' way to >>> replace a DB SSD without moving too >>> > much data? >>> > >>> > Kind regards, >>> > Caspar >>> > >>> _______________________________________________ >>> > ceph-users mailing list >>> > ceph-users@xxxxxxxxxxxxxx >>> <mailto:ceph-users@xxxxxxxxxxxxxx> >>> > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >>> -- >>> Modern, affordable, Swiss Virtual Machines. >>> Visit www.datacenterlight.ch >>> <http://www.datacenterlight.ch> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> <mailto:ceph-users@xxxxxxxxxxxxxx> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> <mailto:ceph-users@xxxxxxxxxxxxxx> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com