cluster down during backfilling, Jewel tunables and client IO optimisations

Andrei Mikhailovsky <andrei@xxxxxxxxxx> · Sat, 18 Jun 2016 17:04:12 +0100 (BST)

Hello ceph users,

I've recently upgraded my ceph cluster from Hammer to Jewel (10.2.1 and then 10.2.2). The cluster was running okay after the upgrade. I've decided to use the optimal tunables for Jewel as the ceph status was complaining about the straw version and my cluster settings were not optimal for jewel. I've not touched tunables since the Firefly release I think. After reading the release notes and the tunables section I have decided to set the crush tunables value to optimal. Taking into account that a few weeks ago I have done a reweight-by-utilization which has moved around about 8% of my cluster objects. This process has not caused any downtime and IO to the virtual machines was available. I have also altered several settings to prioritise client IO in case of repair and backfilling (see config show output below).

Right, so, after i've set tunables to optimal value my cluster indicated that it needs to move around 61% of data in the cluster. The process started and I was seeing speeds of between 800MB/s - 1.5GB/s for recovery. My cluster is pretty small (3 osd servers with 30 osds in total). The load on the osd servers was pretty low. I was seeing a typical load of 4 spiking to around 10. The IO wait values on the osd servers were also pretty reasonable - around 5-15%. There were around 10-15 backfilling processes.

About 10 minutes after the optimal tunables were set i've noticed that IO wait on the vms started to increase. Initially it was 15%, after another 10 mins or so it increased to around 50% and about 30-40 minutes later the iowait became 95-100% on all vms. Shortly after that the vms showed a bunch of hang tasks in dmesg output and shorly stopped responding all together. This kind of behaviour didn't happen after doing reweight-by-utilization, which i've done a few weeks prior. The vms IO wait during the reweithing was around 15-20% and there were no hanged tasks and all vms were running pretty well.

I wasn't sure how to resolve the problem. On one hand I know that recovery and backfilling cause extra load on the cluster, but it should never break client IO. Afterall, this seems to negate one of the key points behind ceph - resilient storage cluster. Looking at the ceph -w output the client IO has decreased to 0-20 IOPs, where as a typical load that I see at that time of the day is around 700-1000 IOPs.

The strange thing is that after the cluster has finished with data move (it took around 11 hours) the client IO was still not available! I was not able to start any new vms despite having OK health status and all PGs in active + clean state. This was pretty strange. All osd servers having almost 0 load, all PGs are active + clean, all osds are up and all mons are up, yet no client IO. The cluster became operational once again after a reboot of one of the osd servers, which seem to have brought the cluster to life.

My question to the community is what ceph options should be implemented to make sure the client IO is _always_ available and has the highest priority during any recovery/migration/backfilling operations? 

My current settings, which i've gathered over the years from the advice of mailing list and irc members are:

osd_recovery_max_chunk = 8388608
osd_recovery_op_priority = 1
osd_max_backfills = 1
osd_recovery_max_active = 1
osd_recovery_threads = 1
osd_disk_thread_ioprio_priority = 7
osd_disk_thread_ioprio_class = idle
osd_scrub_chunk_min = 1
osd_scrub_chunk_max = 5
osd_deep_scrub_stride = 1048576
mon_osd_min_down_reporters = 6
mon_osd_report_timeout = 1800
mon_osd_min_down_reports = 7
osd_heartbeat_grace = 60
osd_mount_options_xfs = "rw,noatime,inode64,logbsize=256k,allocsize=4M"
osd_mkfs_options_xfs = -f -i size=2048
filestore_max_sync_interval = 15
filestore_op_threads = 8
filestore_merge_threshold = 40
filestore_split_multiple = 8
osd_disk_threads = 8
osd_op_threads = 8
osd_pool_default_pg_num = 1024
osd_pool_default_pgp_num = 1024
osd_crush_update_on_start = false

Many thanks

Andrei
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com