ceph osd crush tunables optimal AND add new OSD at the same time

andrei@xxxxxxxxxx (Andrei Mikhailovsky) · Mon, 14 Jul 2014 09:52:01 +0100 (BST)

Hi Andrija, 

I've got at least two more stories of similar nature. One is my friend running a ceph cluster and one is from me. Both of our clusters are pretty small. My cluster has only two osd servers with 8 osds each, 3 mons. I have an ssd journal per 4 osds. My friend has a cluster of 3 mons and 3 osd servers with 4 osds each and an ssd per 4 osds as well. Both clusters are connected with 40gbit/s IP over Infiniband links. 

We had the same issue while upgrading to firefly. However, we did not add any new disks, just ran the "ceph osd crush tunables optimal" command after following an upgrade. 

Both of our clusters were "down" as far as the virtual machines are concerned. All vms have crashed because of the lack of IO. It was a bit problematic, taking into account that ceph is typically so great at staying alive during failures and upgrades. So, there seems to be a problem with the upgrade. I wish devs would have added a big note in red letters that if you run this command it will likely affect your cluster performance and most likely all your vms will die. So, please shutdown your vms if you do not want to have data loss. 

I've changed the default values to reduce the load during recovery and also to tune a few things performance wise. My settings were: 

osd recovery max chunk = 8388608 

osd recovery op priority = 2 

osd max backfills = 1 

osd recovery max active = 1 

osd recovery threads = 1 

osd disk threads = 2 

filestore max sync interval = 10 

filestore op threads = 20 

filestore_flusher = false 

However, this didn't help much and i've noticed that shortly after running the tunnables command my guest vms iowait has quickly jumped to 50% and a to 99% a minute after. This has happened on all vms at once. During the recovery phase I ran the "rbd -p <poolname> ls -l" command several times and it took between 20-40 minutes to complete. It typically takes less than 2 seconds when the cluster is not in recovery mode. 

My mate's cluster had the same tunables apart from the last three. He had exactly the same behaviour. 

One other thing that i've noticed is that somewhere in the docs I've read that running the tunnable optimal command should move not more than 10% of your data. However, in both of our cases our status was just over 30% degraded and it took a good part of 9 hours to complete the data reshuffling. 

Any comments from the ceph team or other ceph gurus on: 

1. What have we done wrong in our upgrade process 
2. What options should we have used to keep our vms alive 

Cheers 

Andrei 

----- Original Message -----

From: "Andrija Panic" <andrija.panic@xxxxxxxxx> 
To: ceph-users at lists.ceph.com 
Sent: Sunday, 13 July, 2014 9:54:17 PM 
Subject: ceph osd crush tunables optimal AND add new OSD at the same time 

Hi, 

after seting ceph upgrade (0.72.2 to 0.80.3) I have issued "ceph osd crush tunables optimal" and after only few minutes I have added 2 more OSDs to the CEPH cluster... 

So these 2 changes were more or a less done at the same time - rebalancing because of tunables optimal, and rebalancing because of adding new OSD... 

Result - all VMs living on CEPH storage have gone mad, no disk access efectively, blocked so to speak. 

Since this rebalancing took 5h-6h, I had bunch of VMs down for that long... 

Did I do wrong by causing "2 rebalancing" to happen at the same time ? 
Is this behaviour normal, to cause great load on all VMs because they are unable to access CEPH storage efectively ? 

Thanks for any input... 
-- 

Andrija Pani? 

_______________________________________________ 
ceph-users mailing list 
ceph-users at lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140714/5c2df472/attachment.htm>