Re: automated bluestore conversion

Alfredo Deza <adeza@xxxxxxxxxx> · Mon, 16 Jul 2018 08:44:04 -0400

On Mon, Jul 16, 2018 at 8:32 AM, Brett Niver <bniver@xxxxxxxxxx> wrote:
> I would ask if either ceph-mgr or ceph-volume is the correct place.
> To me it seems like a "run to finish, sequential automation" type of
> process, which might better be implemented in an ansible playbook
> utilizing ceph-volume?

That is an interesting question. Sure, I think that the end-to-end
process might be a great fit for an Ansible playbook.

The intermediate portions though, are so complicated, that I was
wondering if ceph-volume couldn't do more here since it already knows
how.

>
>
> On Mon, Jul 16, 2018 at 8:11 AM, Alfredo Deza <adeza@xxxxxxxxxx> wrote:
>> On Mon, Jul 9, 2018 at 11:12 AM, Theofilos Mouratidis
>> <mtheofilos@xxxxxxxxx> wrote:
>>> Hello,
>>>
>>> Here at CERN we created some scripts to convert
>>> single hosts from filestore to bluestore with or without
>>> journals. (I run it as we speak) It might be worth a look.
>>> The one with journals is here: https://pastebin.com/raw/0mCQHuAR
>>> For now it requires every osd to be filestore and each
>>> ssd to have the same amount of osds.
>>> The osd ids are preserved to avoid data rebalance.
>>>
>>> First it checks for the requires packages.
>>> Then it creates on /tmp a plan file to execute
>>> From the plan it counts different parameters
>>> such as ssd numbers hdd numbers, partition
>>> sizes etc. It follows the official guide you gave
>>> for coverting a host. In the end after the osds
>>> are drained, they are converted to bluestore
>>> with the journal now as the block.db and they
>>> are marked in to get the backfilled data back.
>>> The job is done per set of X osds that have
>>> the same journal device.
>>>
>>> Cheers,
>>> Theo
>>>
>>> On 9 July 2018 at 15:24, John Spray <jspray@xxxxxxxxxx> wrote:
>>>> On Fri, Jul 6, 2018 at 7:05 PM Sage Weil <sweil@xxxxxxxxxx> wrote:
>>>>>
>>>>> https://pad.ceph.com/p/bluestore_converter
>>>>>
>>>>> I sketched out a mgr module that automates the conversion of OSDs
>>>>> from filestore to bluestore.  It basically has two modes (by osd and by
>>>>> host), mapping to the two variations documented in the docs.  The main
>>>>> difference is that it would do groups of OSDs that share devices, so if
>>>>> you have a 5:1 HDD:SSD ratio it would do 5 OSDs and 6 devices at a time so
>>>>> that the devices can be fully wiped (and we can move from GPT to LVM).
>>>>>
>>>>> There is a big dependency on the new mgr orchestrator layer.  John, does
>>>>> this line up with what you're designing?
>>>>
>>>> Yes -- particularly the need to tag explicit OSD IDs onto the
>>>> definition of a drive group is something that came up thinking about
>>>> how drive replacement will work in general.
>>>>
>>>> The set of transformations we can do on these groups for OSD
>>>> replacement is the next (last?) big question to answer about what
>>>> ceph-volume's interface should look like.  Right now the cases I have
>>>> are:
>>>>  - Normal creation: just a list of devices
>>>>  - Migration creation: a list of devices and a list of OSD IDs
>>>>  - In-place (drive name of replacement is same as original)
>>>> replacement: a list of devices and the name of the device to replace,
>>>> preserving its OSD ID.
>>>>  - General replacement (drive name of replacement is different): a
>>>> list of devices which includes a new device, and the OSD ID that
>>>> should be applied to the new device.
>>>>  - (Maybe) HDD addition, where during initial creation a number of
>>>> "blanks" had been specified to reserve space on SSDs, and we can
>>>> consume these with new HDD members of the group.
>>
>> Seems like most of the steps for converting can be done by
>> ceph-volume. Is polling the safe-to-destroy the reason for placing
>> this in the mgr vs
>> delegating the functionality to ceph-volume?
>>
>> From http://docs.ceph.com/docs/master/rados/operations/bluestore-migration/?highlight=bluestore#mark-out-and-replace
>> these are the ones that
>> ceph-volume can handle today with internal APIs:
>>
>> * identify is an OSD is bluestore/filestore
>> * identify what devices make the OSD
>> * stop/start/status on systemctl units
>> * find the current mount point of an OSD, and see if devices are
>> currently mounted at target
>> * mount and unmount
>>
>> The guide doesn't explain anything about encryption, which as complex
>> as that is today, it might be useful not to try to do this on more
>> than one place
>>
>>
>>>>
>>>> This is a longer list than I'd like, but I don't see a way to make it
>>>> shorter (with the exception of dropping the ability to grow groups).
>>>>
>>>> I've written a document to try and formalize this stuff a bit:
>>>> https://docs.google.com/document/d/1iwTnQc8d9W3BpQHgGYTMZSKvN6J7s0z8kQaYNxYvLho
>>>> (google docs may prompt you to ask for access)
>>>>
>>>> Just updating the orchestrator python code to reflect that doc now.
>>>>
>>>> John
>>>>
>>>>> Also it would need/like the ability to pass a list of OSD IDs to reuse to
>>>>> the new batch prepare function you're building...
>>>>
>>>>
>>>>
>>>>> Thoughts?
>>>>> sage
>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html