Re: automated bluestore conversion

Brett Niver <bniver@xxxxxxxxxx> · Mon, 16 Jul 2018 09:05:57 -0400

Got it.  And yes, explained that way, I wasn't really thinking about
orchestration management, but that makes sense.

On Mon, Jul 16, 2018 at 9:00 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> On Mon, 16 Jul 2018, Brett Niver wrote:
>> I would ask if either ceph-mgr or ceph-volume is the correct place.
>> To me it seems like a "run to finish, sequential automation" type of
>> process, which might better be implemented in an ansible playbook
>> utilizing ceph-volume?
>
> The problem is that this process takes weeks or months, and users will
> realistically need to pause/resume, perhaps change strategy or abort,
> resume again a few weeks later, etc.  I don't think that having users
> leave a terminal open somewhere running a script is a good choice.
>
> The upside is that I think the orchestrator mgr layer we're building
> provides the right set of tools to build this pretty easily.  Doing it
> there means it can work equally well (with an identical user experience)
> with ansible, rook, deapsea, whatever.
>
> sage
>
>
>>
>>
>> On Mon, Jul 16, 2018 at 8:11 AM, Alfredo Deza <adeza@xxxxxxxxxx> wrote:
>> > On Mon, Jul 9, 2018 at 11:12 AM, Theofilos Mouratidis
>> > <mtheofilos@xxxxxxxxx> wrote:
>> >> Hello,
>> >>
>> >> Here at CERN we created some scripts to convert
>> >> single hosts from filestore to bluestore with or without
>> >> journals. (I run it as we speak) It might be worth a look.
>> >> The one with journals is here: https://pastebin.com/raw/0mCQHuAR
>> >> For now it requires every osd to be filestore and each
>> >> ssd to have the same amount of osds.
>> >> The osd ids are preserved to avoid data rebalance.
>> >>
>> >> First it checks for the requires packages.
>> >> Then it creates on /tmp a plan file to execute
>> >> From the plan it counts different parameters
>> >> such as ssd numbers hdd numbers, partition
>> >> sizes etc. It follows the official guide you gave
>> >> for coverting a host. In the end after the osds
>> >> are drained, they are converted to bluestore
>> >> with the journal now as the block.db and they
>> >> are marked in to get the backfilled data back.
>> >> The job is done per set of X osds that have
>> >> the same journal device.
>> >>
>> >> Cheers,
>> >> Theo
>> >>
>> >> On 9 July 2018 at 15:24, John Spray <jspray@xxxxxxxxxx> wrote:
>> >>> On Fri, Jul 6, 2018 at 7:05 PM Sage Weil <sweil@xxxxxxxxxx> wrote:
>> >>>>
>> >>>> https://pad.ceph.com/p/bluestore_converter
>> >>>>
>> >>>> I sketched out a mgr module that automates the conversion of OSDs
>> >>>> from filestore to bluestore.  It basically has two modes (by osd and by
>> >>>> host), mapping to the two variations documented in the docs.  The main
>> >>>> difference is that it would do groups of OSDs that share devices, so if
>> >>>> you have a 5:1 HDD:SSD ratio it would do 5 OSDs and 6 devices at a time so
>> >>>> that the devices can be fully wiped (and we can move from GPT to LVM).
>> >>>>
>> >>>> There is a big dependency on the new mgr orchestrator layer.  John, does
>> >>>> this line up with what you're designing?
>> >>>
>> >>> Yes -- particularly the need to tag explicit OSD IDs onto the
>> >>> definition of a drive group is something that came up thinking about
>> >>> how drive replacement will work in general.
>> >>>
>> >>> The set of transformations we can do on these groups for OSD
>> >>> replacement is the next (last?) big question to answer about what
>> >>> ceph-volume's interface should look like.  Right now the cases I have
>> >>> are:
>> >>>  - Normal creation: just a list of devices
>> >>>  - Migration creation: a list of devices and a list of OSD IDs
>> >>>  - In-place (drive name of replacement is same as original)
>> >>> replacement: a list of devices and the name of the device to replace,
>> >>> preserving its OSD ID.
>> >>>  - General replacement (drive name of replacement is different): a
>> >>> list of devices which includes a new device, and the OSD ID that
>> >>> should be applied to the new device.
>> >>>  - (Maybe) HDD addition, where during initial creation a number of
>> >>> "blanks" had been specified to reserve space on SSDs, and we can
>> >>> consume these with new HDD members of the group.
>> >
>> > Seems like most of the steps for converting can be done by
>> > ceph-volume. Is polling the safe-to-destroy the reason for placing
>> > this in the mgr vs
>> > delegating the functionality to ceph-volume?
>> >
>> > From http://docs.ceph.com/docs/master/rados/operations/bluestore-migration/?highlight=bluestore#mark-out-and-replace
>> > these are the ones that
>> > ceph-volume can handle today with internal APIs:
>> >
>> > * identify is an OSD is bluestore/filestore
>> > * identify what devices make the OSD
>> > * stop/start/status on systemctl units
>> > * find the current mount point of an OSD, and see if devices are
>> > currently mounted at target
>> > * mount and unmount
>> >
>> > The guide doesn't explain anything about encryption, which as complex
>> > as that is today, it might be useful not to try to do this on more
>> > than one place
>> >
>> >
>> >>>
>> >>> This is a longer list than I'd like, but I don't see a way to make it
>> >>> shorter (with the exception of dropping the ability to grow groups).
>> >>>
>> >>> I've written a document to try and formalize this stuff a bit:
>> >>> https://docs.google.com/document/d/1iwTnQc8d9W3BpQHgGYTMZSKvN6J7s0z8kQaYNxYvLho
>> >>> (google docs may prompt you to ask for access)
>> >>>
>> >>> Just updating the orchestrator python code to reflect that doc now.
>> >>>
>> >>> John
>> >>>
>> >>>> Also it would need/like the ability to pass a list of OSD IDs to reuse to
>> >>>> the new batch prepare function you're building...
>> >>>
>> >>>
>> >>>
>> >>>> Thoughts?
>> >>>> sage
>> >>>>
>> >>> --
>> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > the body of a message to majordomo@xxxxxxxxxxxxxxx
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html