Re: automated bluestore conversion

Brett Niver <bniver@xxxxxxxxxx> · Mon, 16 Jul 2018 08:32:56 -0400

I would ask if either ceph-mgr or ceph-volume is the correct place.
To me it seems like a "run to finish, sequential automation" type of
process, which might better be implemented in an ansible playbook
utilizing ceph-volume?

On Mon, Jul 16, 2018 at 8:11 AM, Alfredo Deza <adeza@xxxxxxxxxx> wrote:
> On Mon, Jul 9, 2018 at 11:12 AM, Theofilos Mouratidis
> <mtheofilos@xxxxxxxxx> wrote:
>> Hello,
>>
>> Here at CERN we created some scripts to convert
>> single hosts from filestore to bluestore with or without
>> journals. (I run it as we speak) It might be worth a look.
>> The one with journals is here: https://pastebin.com/raw/0mCQHuAR
>> For now it requires every osd to be filestore and each
>> ssd to have the same amount of osds.
>> The osd ids are preserved to avoid data rebalance.
>>
>> First it checks for the requires packages.
>> Then it creates on /tmp a plan file to execute
>> From the plan it counts different parameters
>> such as ssd numbers hdd numbers, partition
>> sizes etc. It follows the official guide you gave
>> for coverting a host. In the end after the osds
>> are drained, they are converted to bluestore
>> with the journal now as the block.db and they
>> are marked in to get the backfilled data back.
>> The job is done per set of X osds that have
>> the same journal device.
>>
>> Cheers,
>> Theo
>>
>> On 9 July 2018 at 15:24, John Spray <jspray@xxxxxxxxxx> wrote:
>>> On Fri, Jul 6, 2018 at 7:05 PM Sage Weil <sweil@xxxxxxxxxx> wrote:
>>>>
>>>> https://pad.ceph.com/p/bluestore_converter
>>>>
>>>> I sketched out a mgr module that automates the conversion of OSDs
>>>> from filestore to bluestore.  It basically has two modes (by osd and by
>>>> host), mapping to the two variations documented in the docs.  The main
>>>> difference is that it would do groups of OSDs that share devices, so if
>>>> you have a 5:1 HDD:SSD ratio it would do 5 OSDs and 6 devices at a time so
>>>> that the devices can be fully wiped (and we can move from GPT to LVM).
>>>>
>>>> There is a big dependency on the new mgr orchestrator layer.  John, does
>>>> this line up with what you're designing?
>>>
>>> Yes -- particularly the need to tag explicit OSD IDs onto the
>>> definition of a drive group is something that came up thinking about
>>> how drive replacement will work in general.
>>>
>>> The set of transformations we can do on these groups for OSD
>>> replacement is the next (last?) big question to answer about what
>>> ceph-volume's interface should look like.  Right now the cases I have
>>> are:
>>>  - Normal creation: just a list of devices
>>>  - Migration creation: a list of devices and a list of OSD IDs
>>>  - In-place (drive name of replacement is same as original)
>>> replacement: a list of devices and the name of the device to replace,
>>> preserving its OSD ID.
>>>  - General replacement (drive name of replacement is different): a
>>> list of devices which includes a new device, and the OSD ID that
>>> should be applied to the new device.
>>>  - (Maybe) HDD addition, where during initial creation a number of
>>> "blanks" had been specified to reserve space on SSDs, and we can
>>> consume these with new HDD members of the group.
>
> Seems like most of the steps for converting can be done by
> ceph-volume. Is polling the safe-to-destroy the reason for placing
> this in the mgr vs
> delegating the functionality to ceph-volume?
>
> From http://docs.ceph.com/docs/master/rados/operations/bluestore-migration/?highlight=bluestore#mark-out-and-replace
> these are the ones that
> ceph-volume can handle today with internal APIs:
>
> * identify is an OSD is bluestore/filestore
> * identify what devices make the OSD
> * stop/start/status on systemctl units
> * find the current mount point of an OSD, and see if devices are
> currently mounted at target
> * mount and unmount
>
> The guide doesn't explain anything about encryption, which as complex
> as that is today, it might be useful not to try to do this on more
> than one place
>
>
>>>
>>> This is a longer list than I'd like, but I don't see a way to make it
>>> shorter (with the exception of dropping the ability to grow groups).
>>>
>>> I've written a document to try and formalize this stuff a bit:
>>> https://docs.google.com/document/d/1iwTnQc8d9W3BpQHgGYTMZSKvN6J7s0z8kQaYNxYvLho
>>> (google docs may prompt you to ask for access)
>>>
>>> Just updating the orchestrator python code to reflect that doc now.
>>>
>>> John
>>>
>>>> Also it would need/like the ability to pass a list of OSD IDs to reuse to
>>>> the new batch prepare function you're building...
>>>
>>>
>>>
>>>> Thoughts?
>>>> sage
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html