Re: cluster down during backfilling, Jewel tunables and client IO optimisations

Oliver Dzombic <info@xxxxxxxxxxxxxxxxx> · Sun, 19 Jun 2016 11:14:35 +0200

Hi,

so far the key values for that are:

osd_client_op_priority = 63 ( anyway default, but i set it to remember it )
osd_recovery_op_priority = 1

In addition i set:

osd_max_backfills = 1
osd_recovery_max_active = 1

-------------------

But according to your settings its all ok.

According to what you described, the problem was not the backfilling but
something else inside the cluster. Maybe something was blocked somewhere
and only a reset could help. The logs would might have given an answer
about that.

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:info@xxxxxxxxxxxxxxxxx

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107

Am 18.06.2016 um 18:04 schrieb Andrei Mikhailovsky:
> Hello ceph users,
> 
> I've recently upgraded my ceph cluster from Hammer to Jewel (10.2.1 and
> then 10.2.2). The cluster was running okay after the upgrade. I've
> decided to use the optimal tunables for Jewel as the ceph status was
> complaining about the straw version and my cluster settings were not
> optimal for jewel. I've not touched tunables since the Firefly release I
> think. After reading the release notes and the tunables section I have
> decided to set the crush tunables value to optimal. Taking into account
> that a few weeks ago I have done a /reweight/-by-/utilization /which has
> moved around about 8% of my cluster objects. This process has not caused
> any downtime and IO to the virtual machines was available. I have also
> altered several settings to prioritise client IO in case of repair and
> backfilling (see config show output below).
> 
> Right, so, after i've set tunables to optimal value my cluster indicated
> that it needs to move around 61% of data in the cluster. The process
> started and I was seeing speeds of between 800MB/s - 1.5GB/s for
> recovery. My cluster is pretty small (3 osd servers with 30 osds in
> total). The load on the osd servers was pretty low. I was seeing a
> typical load of 4 spiking to around 10. The IO wait values on the osd
> servers were also pretty reasonable - around 5-15%. There were around
> 10-15 backfilling processes.
> 
> About 10 minutes after the optimal tunables were set i've noticed that
> IO wait on the vms started to increase. Initially it was 15%, after
> another 10 mins or so it increased to around 50% and about 30-40 minutes
> later the iowait became 95-100% on all vms. Shortly after that the vms
> showed a bunch of hang tasks in dmesg output and shorly stopped
> responding all together. This kind of behaviour didn't happen after
> doing reweight-by-utilization, which i've done a few weeks prior. The
> vms IO wait during the reweithing was around 15-20% and there were no
> hanged tasks and all vms were running pretty well.
> 
> I wasn't sure how to resolve the problem. On one hand I know that
> recovery and backfilling cause extra load on the cluster, but it should
> never break client IO. Afterall, this seems to negate one of the key
> points behind ceph - resilient storage cluster. Looking at the ceph -w
> output the client IO has decreased to 0-20 IOPs, where as a typical load
> that I see at that time of the day is around 700-1000 IOPs.
> 
> The strange thing is that after the cluster has finished with data move
> (it took around 11 hours) the client IO was still not available! I was
> not able to start any new vms despite having OK health status and all
> PGs in active + clean state. This was pretty strange. All osd servers
> having almost 0 load, all PGs are active + clean, all osds are up and
> all mons are up, yet no client IO. The cluster became operational once
> again after a reboot of one of the osd servers, which seem to have
> brought the cluster to life.
> 
> My question to the community is what ceph options should be implemented
> to make sure the client IO is _always_ available and has the highest
> priority during any recovery/migration/backfilling operations?
> 
> My current settings, which i've gathered over the years from the advice
> of mailing list and irc members are:
> 
> osd_recovery_max_chunk = 8388608
> osd_recovery_op_priority = 1
> osd_max_backfills = 1
> osd_recovery_max_active = 1
> osd_recovery_threads = 1
> osd_disk_thread_ioprio_priority = 7
> osd_disk_thread_ioprio_class = idle
> osd_scrub_chunk_min = 1
> osd_scrub_chunk_max = 5
> osd_deep_scrub_stride = 1048576
> mon_osd_min_down_reporters = 6
> mon_osd_report_timeout = 1800
> mon_osd_min_down_reports = 7
> osd_heartbeat_grace = 60
> osd_mount_options_xfs = "rw,noatime,inode64,logbsize=256k,allocsize=4M"
> osd_mkfs_options_xfs = -f -i size=2048
> filestore_max_sync_interval = 15
> filestore_op_threads = 8
> filestore_merge_threshold = 40
> filestore_split_multiple = 8
> osd_disk_threads = 8
> osd_op_threads = 8
> osd_pool_default_pg_num = 1024
> osd_pool_default_pgp_num = 1024
> osd_crush_update_on_start = false
> 
> Many thanks
> 
> Andrei
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com