Re: cluster down during backfilling, Jewel tunables and client IO optimisations

Andrei Mikhailovsky <andrei@xxxxxxxxxx> · Mon, 20 Jun 2016 12:54:00 +0100 (BST)

Hi Oliver,

I am also seeing this as a strange behavriour indeed! I was going through the logs and I was not able to find any errors or issues. There was also no slow/blocked requests that I could see during the recovery process.

Does anyone has an idea what could be the issue here? I don't want to shut down all vms every time there is a new release with updated tunable values.

Andrei

----- Original Message -----
> From: "Oliver Dzombic" <info@xxxxxxxxxxxxxxxxx>
> To: "andrei" <andrei@xxxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
> Sent: Sunday, 19 June, 2016 10:14:35
> Subject: Re:  cluster down during backfilling, Jewel tunables and client IO optimisations

> Hi,
> 
> so far the key values for that are:
> 
> osd_client_op_priority = 63 ( anyway default, but i set it to remember it )
> osd_recovery_op_priority = 1
> 
> 
> In addition i set:
> 
> osd_max_backfills = 1
> osd_recovery_max_active = 1
> 
> 
> -------------------
> 
> 
> But according to your settings its all ok.
> 
> According to what you described, the problem was not the backfilling but
> something else inside the cluster. Maybe something was blocked somewhere
> and only a reset could help. The logs would might have given an answer
> about that.
> 
> --
> Mit freundlichen Gruessen / Best regards
> 
> Oliver Dzombic
> IP-Interactive
> 
> mailto:info@xxxxxxxxxxxxxxxxx
> 
> Anschrift:
> 
> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen
> 
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
> 
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
> 
> 
> Am 18.06.2016 um 18:04 schrieb Andrei Mikhailovsky:
>> Hello ceph users,
>> 
>> I've recently upgraded my ceph cluster from Hammer to Jewel (10.2.1 and
>> then 10.2.2). The cluster was running okay after the upgrade. I've
>> decided to use the optimal tunables for Jewel as the ceph status was
>> complaining about the straw version and my cluster settings were not
>> optimal for jewel. I've not touched tunables since the Firefly release I
>> think. After reading the release notes and the tunables section I have
>> decided to set the crush tunables value to optimal. Taking into account
>> that a few weeks ago I have done a /reweight/-by-/utilization /which has
>> moved around about 8% of my cluster objects. This process has not caused
>> any downtime and IO to the virtual machines was available. I have also
>> altered several settings to prioritise client IO in case of repair and
>> backfilling (see config show output below).
>> 
>> Right, so, after i've set tunables to optimal value my cluster indicated
>> that it needs to move around 61% of data in the cluster. The process
>> started and I was seeing speeds of between 800MB/s - 1.5GB/s for
>> recovery. My cluster is pretty small (3 osd servers with 30 osds in
>> total). The load on the osd servers was pretty low. I was seeing a
>> typical load of 4 spiking to around 10. The IO wait values on the osd
>> servers were also pretty reasonable - around 5-15%. There were around
>> 10-15 backfilling processes.
>> 
>> About 10 minutes after the optimal tunables were set i've noticed that
>> IO wait on the vms started to increase. Initially it was 15%, after
>> another 10 mins or so it increased to around 50% and about 30-40 minutes
>> later the iowait became 95-100% on all vms. Shortly after that the vms
>> showed a bunch of hang tasks in dmesg output and shorly stopped
>> responding all together. This kind of behaviour didn't happen after
>> doing reweight-by-utilization, which i've done a few weeks prior. The
>> vms IO wait during the reweithing was around 15-20% and there were no
>> hanged tasks and all vms were running pretty well.
>> 
>> I wasn't sure how to resolve the problem. On one hand I know that
>> recovery and backfilling cause extra load on the cluster, but it should
>> never break client IO. Afterall, this seems to negate one of the key
>> points behind ceph - resilient storage cluster. Looking at the ceph -w
>> output the client IO has decreased to 0-20 IOPs, where as a typical load
>> that I see at that time of the day is around 700-1000 IOPs.
>> 
>> The strange thing is that after the cluster has finished with data move
>> (it took around 11 hours) the client IO was still not available! I was
>> not able to start any new vms despite having OK health status and all
>> PGs in active + clean state. This was pretty strange. All osd servers
>> having almost 0 load, all PGs are active + clean, all osds are up and
>> all mons are up, yet no client IO. The cluster became operational once
>> again after a reboot of one of the osd servers, which seem to have
>> brought the cluster to life.
>> 
>> My question to the community is what ceph options should be implemented
>> to make sure the client IO is _always_ available and has the highest
>> priority during any recovery/migration/backfilling operations?
>> 
>> My current settings, which i've gathered over the years from the advice
>> of mailing list and irc members are:
>> 
>> osd_recovery_max_chunk = 8388608
>> osd_recovery_op_priority = 1
>> osd_max_backfills = 1
>> osd_recovery_max_active = 1
>> osd_recovery_threads = 1
>> osd_disk_thread_ioprio_priority = 7
>> osd_disk_thread_ioprio_class = idle
>> osd_scrub_chunk_min = 1
>> osd_scrub_chunk_max = 5
>> osd_deep_scrub_stride = 1048576
>> mon_osd_min_down_reporters = 6
>> mon_osd_report_timeout = 1800
>> mon_osd_min_down_reports = 7
>> osd_heartbeat_grace = 60
>> osd_mount_options_xfs = "rw,noatime,inode64,logbsize=256k,allocsize=4M"
>> osd_mkfs_options_xfs = -f -i size=2048
>> filestore_max_sync_interval = 15
>> filestore_op_threads = 8
>> filestore_merge_threshold = 40
>> filestore_split_multiple = 8
>> osd_disk_threads = 8
>> osd_op_threads = 8
>> osd_pool_default_pg_num = 1024
>> osd_pool_default_pgp_num = 1024
>> osd_crush_update_on_start = false
>> 
>> Many thanks
>> 
>> Andrei
>> 
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com