Re: automated bluestore conversion

Brett Niver <bniver@xxxxxxxxxx> · Mon, 16 Jul 2018 09:07:48 -0400

I would also think that both ansible playbooks and ceph-volume might
be technologies utilized by this orchestration layer, correct?

On Mon, Jul 16, 2018 at 9:05 AM, Brett Niver <bniver@xxxxxxxxxx> wrote:
> Got it.  And yes, explained that way, I wasn't really thinking about
> orchestration management, but that makes sense.
>
>
>
> On Mon, Jul 16, 2018 at 9:00 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>> On Mon, 16 Jul 2018, Brett Niver wrote:
>>> I would ask if either ceph-mgr or ceph-volume is the correct place.
>>> To me it seems like a "run to finish, sequential automation" type of
>>> process, which might better be implemented in an ansible playbook
>>> utilizing ceph-volume?
>>
>> The problem is that this process takes weeks or months, and users will
>> realistically need to pause/resume, perhaps change strategy or abort,
>> resume again a few weeks later, etc.  I don't think that having users
>> leave a terminal open somewhere running a script is a good choice.
>>
>> The upside is that I think the orchestrator mgr layer we're building
>> provides the right set of tools to build this pretty easily.  Doing it
>> there means it can work equally well (with an identical user experience)
>> with ansible, rook, deapsea, whatever.
>>
>> sage
>>
>>
>>>
>>>
>>> On Mon, Jul 16, 2018 at 8:11 AM, Alfredo Deza <adeza@xxxxxxxxxx> wrote:
>>> > On Mon, Jul 9, 2018 at 11:12 AM, Theofilos Mouratidis
>>> > <mtheofilos@xxxxxxxxx> wrote:
>>> >> Hello,
>>> >>
>>> >> Here at CERN we created some scripts to convert
>>> >> single hosts from filestore to bluestore with or without
>>> >> journals. (I run it as we speak) It might be worth a look.
>>> >> The one with journals is here: https://pastebin.com/raw/0mCQHuAR
>>> >> For now it requires every osd to be filestore and each
>>> >> ssd to have the same amount of osds.
>>> >> The osd ids are preserved to avoid data rebalance.
>>> >>
>>> >> First it checks for the requires packages.
>>> >> Then it creates on /tmp a plan file to execute
>>> >> From the plan it counts different parameters
>>> >> such as ssd numbers hdd numbers, partition
>>> >> sizes etc. It follows the official guide you gave
>>> >> for coverting a host. In the end after the osds
>>> >> are drained, they are converted to bluestore
>>> >> with the journal now as the block.db and they
>>> >> are marked in to get the backfilled data back.
>>> >> The job is done per set of X osds that have
>>> >> the same journal device.
>>> >>
>>> >> Cheers,
>>> >> Theo
>>> >>
>>> >> On 9 July 2018 at 15:24, John Spray <jspray@xxxxxxxxxx> wrote:
>>> >>> On Fri, Jul 6, 2018 at 7:05 PM Sage Weil <sweil@xxxxxxxxxx> wrote:
>>> >>>>
>>> >>>> https://pad.ceph.com/p/bluestore_converter
>>> >>>>
>>> >>>> I sketched out a mgr module that automates the conversion of OSDs
>>> >>>> from filestore to bluestore.  It basically has two modes (by osd and by
>>> >>>> host), mapping to the two variations documented in the docs.  The main
>>> >>>> difference is that it would do groups of OSDs that share devices, so if
>>> >>>> you have a 5:1 HDD:SSD ratio it would do 5 OSDs and 6 devices at a time so
>>> >>>> that the devices can be fully wiped (and we can move from GPT to LVM).
>>> >>>>
>>> >>>> There is a big dependency on the new mgr orchestrator layer.  John, does
>>> >>>> this line up with what you're designing?
>>> >>>
>>> >>> Yes -- particularly the need to tag explicit OSD IDs onto the
>>> >>> definition of a drive group is something that came up thinking about
>>> >>> how drive replacement will work in general.
>>> >>>
>>> >>> The set of transformations we can do on these groups for OSD
>>> >>> replacement is the next (last?) big question to answer about what
>>> >>> ceph-volume's interface should look like.  Right now the cases I have
>>> >>> are:
>>> >>>  - Normal creation: just a list of devices
>>> >>>  - Migration creation: a list of devices and a list of OSD IDs
>>> >>>  - In-place (drive name of replacement is same as original)
>>> >>> replacement: a list of devices and the name of the device to replace,
>>> >>> preserving its OSD ID.
>>> >>>  - General replacement (drive name of replacement is different): a
>>> >>> list of devices which includes a new device, and the OSD ID that
>>> >>> should be applied to the new device.
>>> >>>  - (Maybe) HDD addition, where during initial creation a number of
>>> >>> "blanks" had been specified to reserve space on SSDs, and we can
>>> >>> consume these with new HDD members of the group.
>>> >
>>> > Seems like most of the steps for converting can be done by
>>> > ceph-volume. Is polling the safe-to-destroy the reason for placing
>>> > this in the mgr vs
>>> > delegating the functionality to ceph-volume?
>>> >
>>> > From http://docs.ceph.com/docs/master/rados/operations/bluestore-migration/?highlight=bluestore#mark-out-and-replace
>>> > these are the ones that
>>> > ceph-volume can handle today with internal APIs:
>>> >
>>> > * identify is an OSD is bluestore/filestore
>>> > * identify what devices make the OSD
>>> > * stop/start/status on systemctl units
>>> > * find the current mount point of an OSD, and see if devices are
>>> > currently mounted at target
>>> > * mount and unmount
>>> >
>>> > The guide doesn't explain anything about encryption, which as complex
>>> > as that is today, it might be useful not to try to do this on more
>>> > than one place
>>> >
>>> >
>>> >>>
>>> >>> This is a longer list than I'd like, but I don't see a way to make it
>>> >>> shorter (with the exception of dropping the ability to grow groups).
>>> >>>
>>> >>> I've written a document to try and formalize this stuff a bit:
>>> >>> https://docs.google.com/document/d/1iwTnQc8d9W3BpQHgGYTMZSKvN6J7s0z8kQaYNxYvLho
>>> >>> (google docs may prompt you to ask for access)
>>> >>>
>>> >>> Just updating the orchestrator python code to reflect that doc now.
>>> >>>
>>> >>> John
>>> >>>
>>> >>>> Also it would need/like the ability to pass a list of OSD IDs to reuse to
>>> >>>> the new batch prepare function you're building...
>>> >>>
>>> >>>
>>> >>>
>>> >>>> Thoughts?
>>> >>>> sage
>>> >>>>
>>> >>> --
>>> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> >>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> > --
>>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> > the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html