Re: Expanding a ceph cluster with ansible

Sebastien Han <seb@xxxxxxxxxx> · Wed, 24 Jun 2015 16:36:47 +0200

Bryan,

Answers inline.

> On 24 Jun 2015, at 00:52, Stillwell, Bryan <bryan.stillwell@xxxxxxxxxxx> wrote:
> 
> Sébastien,
> 
> Nothing has gone wrong with using it in this way, it just has to do with
> my lack
> of experience with ansible/ceph-ansible.  I'm learning both now, but would
> love
> if there were more documentation around using them.  For example this
> documentation around using ceph-deploy is pretty good, and I was hoping for
> something equivalent for ceph-ansible:
> 
> http://ceph.com/docs/master/rados/deployment/
> 

Well if this is not enough: https://github.com/ceph/ceph-ansible/wiki
Please open an issue with what’s missing and I’ll make sure to clarify everything ASAP.

> 
> With that said, I'm wondering what tweaks do you think would be needed to
> get
> ceph-ansible working on an existing cluster?

There are critical variables to edit, so the first thing to do will be to make sure that you perfectly match some variables with your current configuration.

Btw I just tried the following:

* deployed a cluster with ceph-deploy: 1 mons (on ceph1), 3 OSDs (on ceph4, ceph5, ceph6)
* 1 SSD for the journal per OSD

Then I configured ceph-ansible normally:

* ran ‘ceph fsid’ to pick up the uuid used  and edited group_vars/{all,mons,osds} with it (var fsid)
* collected the monitor keyring here: /var/lib/ceph/mon/ceph-ceph-eno1/keyring and put it in group_vars/mons on monitor_secret
* configured the monitor_interface variable in group_vars/all, this one might be tricky make sure that ceph-deploy used the right interface beforehand
* change the journal_size variable in group_vars/all and used 5120 (ceph-deploy default)
* change the public_network and cluster_network variables in group_vars/all
* removed everything in ~./ceph-ansible/fetch
* configure ceph-ansible to use a dedicated journal (journal_collocation: false and raw_multi_journal: true and edited raw_journal_devices variable)

Eventually ran “ansible-playbook site.yml”  and everything went well.
I now have 3 monitors and 4 new OSDs per host all using the same SSDs, so 25 in total.
Given that ceph-ansible follows ceph-deploy best practices, it worked without too much difficulty.
I’d say that it depends how the cluster was bootstrapped in the first place.

> 
> Also to answer your other questions, I haven't tried expanding the cluster
> with
> ceph-ansible yet.  I'm playing around with it in vagrant/virtualbox, and
> it looks
> pretty awesome so far!  If everything goes well, I'm not against
> revisiting the
> choice of puppet-ceph and replacing it with ceph-ansible.

Awesome, don’t hesitate and let me know if I can help with this task.

> 
> One other question, how well does ceph-ansible handle replacing a failed
> HDD
> (/dev/sdo) that has the journal at the beginning or middle of an SSD
> (/dev/sdd2)?

At the moment, it doesn’t.
Ceph-ansible just expects some basic mapping between OSDs and journals.
ceph-disk will do the partitioning, so ceph-ansible doesn’t have any knowledge of the layout.
It’d say that this intelligence should probably go intro ceph-disk itself or not but this idea will be to tell ceph-disk to re-use a partition that was a journal once.
Then we can build another ansible playbook to re-populate a list of OSDs that died.
I’ll have a look at that and will let you know.

A bit more about device management in Ceph Ansible.
For instance, depending on the scenario you choose.
Let’s assume you go with dedicated SSDs for your journal, we have 2 variables:

* devices (https://github.com/ceph/ceph-ansible/blob/master/roles/ceph-osd/defaults/main.yml#L51): that contains a list of device where to store OSD data
* raw_journal_devices (https://github.com/ceph/ceph-ansible/blob/master/roles/ceph-osd/defaults/main.yml#L89): that contains the list of SSD that will host a journal

So you can imagine having:

devices:
  - /dev/sdb
  - /dev/sdc
  - /dev/sdd
  - /dev/sde

raw_journal_devices:
  - /dev/sdu
  - /dev/sdu
  - /dev/sdv
  - /dev/sdv

Where sdb, sdc will have sdu as a journal device and sdd, see will have sdv as a journal device.

I should probably rework a little bit this part with an easier declaration though...

> Thanks,
> Bryan
> 
> On 6/22/15, 7:09 AM, "Sebastien Han" <seb@xxxxxxxxxx> wrote:
> 
>> Hi Bryan,
>> 
>> It shouldn¹t be a problem for ceph-ansible to expand a cluster even if it
>> wasn¹t deployed with it.
>> I believe this requires a bit of tweaking on the ceph-ansible, but it¹s
>> not much.
>> Can you elaborate on what went wrong and perhaps how you configured
>> ceph-ansible?
>> 
>> As far as I understood, you haven¹t been able to grow the size of your
>> cluster by adding new disks/nodes?
>> Is this statement correct?
>> 
>> One more thing, why don¹t you use ceph-ansible entirely to do the
>> provisioning and life cycle management of your cluster? :)
>> 
>>> On 18 Jun 2015, at 00:14, Stillwell, Bryan
>>> <bryan.stillwell@xxxxxxxxxxx> wrote:
>>> 
>>> I've been working on automating a lot of our ceph admin tasks lately
>>> and am
>>> pretty pleased with how the puppet-ceph module has worked for installing
>>> packages, managing ceph.conf, and creating the mon nodes.  However, I
>>> don't
>>> like the idea of puppet managing the OSDs.  Since we also use ansible
>>> in my
>>> group, I took a look at ceph-ansible to see how it might be used to
>>> complete
>>> this task.  I see examples for doing a rolling update and for doing an
>>> os
>>> migration, but nothing for adding a node or multiple nodes at once.  I
>>> don't
>>> have a problem doing this work, but wanted to check with the community
>>> if
>>> any one has experience using ceph-ansible for this?
>>> 
>>> After a lot of trial and error I found the following process works well
>>> when
>>> using ceph-deploy, but it's a lot of steps and can be error prone
>>> (especially if you have old cephx keys that haven't been removed yet):
>>> 
>>> # Disable backfilling and scrubbing to prevent too many performance
>>> # impacting tasks from happening at the same time.  Maybe adding
>>> norecover
>>> # to this list might be a good idea so only peering happens at first.
>>> ceph osd set nobackfill
>>> ceph osd set noscrub
>>> ceph osd set nodeep-scrub
>>> 
>>> # Zap the disks to start from a clean slate
>>> ceph-deploy disk zap dnvrco01-cephosd-025:sd{b..y}
>>> 
>>> # Prepare the disks.  I found sleeping between adding each disk can help
>>> # prevent performance problems.
>>> ceph-deploy osd prepare dnvrco01-cephosd-025:sdh:/dev/sdb; sleep 15
>>> ceph-deploy osd prepare dnvrco01-cephosd-025:sdi:/dev/sdb; sleep 15
>>> ceph-deploy osd prepare dnvrco01-cephosd-025:sdj:/dev/sdb; sleep 15
>>> ceph-deploy osd prepare dnvrco01-cephosd-025:sdk:/dev/sdc; sleep 15
>>> ceph-deploy osd prepare dnvrco01-cephosd-025:sdl:/dev/sdc; sleep 15
>>> ceph-deploy osd prepare dnvrco01-cephosd-025:sdm:/dev/sdc; sleep 15
>>> ceph-deploy osd prepare dnvrco01-cephosd-025:sdn:/dev/sdd; sleep 15
>>> ceph-deploy osd prepare dnvrco01-cephosd-025:sdo:/dev/sdd; sleep 15
>>> ceph-deploy osd prepare dnvrco01-cephosd-025:sdp:/dev/sdd; sleep 15
>>> ceph-deploy osd prepare dnvrco01-cephosd-025:sdq:/dev/sde; sleep 15
>>> ceph-deploy osd prepare dnvrco01-cephosd-025:sdr:/dev/sde; sleep 15
>>> ceph-deploy osd prepare dnvrco01-cephosd-025:sds:/dev/sde; sleep 15
>>> ceph-deploy osd prepare dnvrco01-cephosd-025:sdt:/dev/sdf; sleep 15
>>> ceph-deploy osd prepare dnvrco01-cephosd-025:sdu:/dev/sdf; sleep 15
>>> ceph-deploy osd prepare dnvrco01-cephosd-025:sdv:/dev/sdf; sleep 15
>>> ceph-deploy osd prepare dnvrco01-cephosd-025:sdw:/dev/sdg; sleep 15
>>> ceph-deploy osd prepare dnvrco01-cephosd-025:sdx:/dev/sdg; sleep 15
>>> ceph-deploy osd prepare dnvrco01-cephosd-025:sdy:/dev/sdg; sleep 15
>>> 
>>> # Weight in the new OSDs.  We set 'osd_crush_initial_weight = 0' to
>>> prevent
>>> # them from being added in during the prepare step.  Maybe a longer
>>> weight
>>> # in the last step would make this step unncessary.
>>> ceph osd crush reweight osd.450 1.09; sleep 60
>>> ceph osd crush reweight osd.451 1.09; sleep 60
>>> ceph osd crush reweight osd.452 1.09; sleep 60
>>> ceph osd crush reweight osd.453 1.09; sleep 60
>>> ceph osd crush reweight osd.454 1.09; sleep 60
>>> ceph osd crush reweight osd.455 1.09; sleep 60
>>> ceph osd crush reweight osd.456 1.09; sleep 60
>>> ceph osd crush reweight osd.457 1.09; sleep 60
>>> ceph osd crush reweight osd.458 1.09; sleep 60
>>> ceph osd crush reweight osd.459 1.09; sleep 60
>>> ceph osd crush reweight osd.460 1.09; sleep 60
>>> ceph osd crush reweight osd.461 1.09; sleep 60
>>> ceph osd crush reweight osd.462 1.09; sleep 60
>>> ceph osd crush reweight osd.463 1.09; sleep 60
>>> ceph osd crush reweight osd.464 1.09; sleep 60
>>> ceph osd crush reweight osd.465 1.09; sleep 60
>>> ceph osd crush reweight osd.466 1.09; sleep 60
>>> ceph osd crush reweight osd.467 1.09; sleep 60
>>> 
>>> # Once all the OSDs are added to the cluster, allow the backfill
>>> process to
>>> # begin.
>>> ceph osd unset nobackfill
>>> 
>>> # Then once cluster is healthy again, re-enable scrubbing
>>> ceph osd unset noscrub
>>> ceph osd unset nodeep-scrub
>>> 
>>> 
>>> This E-mail and any of its attachments may contain Time Warner Cable
>>> proprietary information, which is privileged, confidential, or subject
>>> to copyright belonging to Time Warner Cable. This E-mail is intended
>>> solely for the use of the individual or entity to which it is addressed.
>>> If you are not the intended recipient of this E-mail, you are hereby
>>> notified that any dissemination, distribution, copying, or action taken
>>> in relation to the contents of and attachments to this E-mail is
>>> strictly prohibited and may be unlawful. If you have received this
>>> E-mail in error, please notify the sender immediately and permanently
>>> delete the original and any copy of this E-mail and any printout.
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
>> Cheers.
>> 
>> Sébastien Han
>> Senior Cloud Architect
>> 
>> "Always give 100%. Unless you're giving blood."
>> 
>> Mail: seb@xxxxxxxxxx
>> Address: 11 bis, rue Roquépine - 75008 Paris
>> 
> 
> 
> This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.

Cheers.
–––– 
Sébastien Han 
Senior Cloud Architect 

"Always give 100%. Unless you're giving blood."

Mail: seb@xxxxxxxxxx 
Address: 11 bis, rue Roquépine - 75008 Paris

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com