Hi Daniel, > > After upgrading the cluster to Jewel, I changed our crushmap to use the > newer straw2 algorithm, which resulted in a little data movment, but no > problems at that stage. I've not done that, instead i've switch the profile to optimal rightaway. > > Once the cluster had settled down again, I set tunables to optimal > (hammer profile -> jewel profile), which has triggered between 50% and > 70% misplaced PGs on our clusters. This is when the trouble started each > time, and when we had cascading failures of VMs. > > However, after performing hard shutdowns on the VMs and restarting them, > they seemed to be OK. > > At this stage, I have a strong suspicion that it is the introduction of > "require_feature_tunables5 = 1" in the tunables. This seems to require > all RADOS connections to be re-established. > In my experience,, shutting down the vm and restarting didn't help. I've waited about 30+ minutes for the vm to start, but it was still unable to start. I've also noticed that it took a while for vms to start failing, initially the IO wait on vms went up just a bit and it slowly started to increase over the course of about an hour. At the end, there was 100% iowait on all vms. If this was the case, wouldn't I see iowait jumping to 100% pretty quickly? Also, I wasn't able to start any of my vms until i've rebooted one of my osd / mon servers following the successful PGs rebuild. > > On 20/06/16 13:54, Andrei Mikhailovsky wrote: >> Hi Oliver, >> >> I am also seeing this as a strange behavriour indeed! I was going through the >> logs and I was not able to find any errors or issues. There was also no >> slow/blocked requests that I could see during the recovery process. >> >> Does anyone has an idea what could be the issue here? I don't want to shut down >> all vms every time there is a new release with updated tunable values. >> >> >> Andrei >> >> >> >> ----- Original Message ----- >>> From: "Oliver Dzombic" <info@xxxxxxxxxxxxxxxxx> >>> To: "andrei" <andrei@xxxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxxxxxxxxx> >>> Sent: Sunday, 19 June, 2016 10:14:35 >>> Subject: Re: cluster down during backfilling, Jewel tunables and >>> client IO optimisations >> >>> Hi, >>> >>> so far the key values for that are: >>> >>> osd_client_op_priority = 63 ( anyway default, but i set it to remember it ) >>> osd_recovery_op_priority = 1 >>> >>> >>> In addition i set: >>> >>> osd_max_backfills = 1 >>> osd_recovery_max_active = 1 >>> > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com