Hello ceph-users.
Short description: during snapshot removal osd usilisation goes
up to 100%, which leads to slow requests and VM failures due to
IOPS stall.
We're using Openstack Cinder with CEPH cluster as a volume
backend. CEPH version is 10.2.6.
We also using cinder-backup to create backups of those
volumes in CEPH, which uses snapshot and layering features I
guess.
Cluster consists of 5 OSD nodes with mixed SSD/HDD storage,
separate SSD for HDD journals, separate 10Gb/s public and
private networks, 3 MON nodes. We also have a single "backup"
node which is responsible for "backups" pool, handled by CRUSH
map rules.
While creating backup everything looks good. Backup node is
overwhelmed with load, but that's to be expected. Problem
begins when we start deleting old backups.
While old backup is deleted, utilization of main nodes OSDs
skyrockets up to 100%. This leads to slow requests in main
storage pools, which, given enough time, can lead to a process
hang, or at least SCSI reset attempts, and in worst cases - VM
hangs.
I'm looking for a solution to avoid this issue.
So far I understand that I don't know how CEPH snapshot
mechanics work at all, because I can't figure why deleting a
backup leads to requests not to backup OSDs, where backup data
is really stored, but rather to main OSDs, where original
objects reside. Is there any good doc on this?
Googling shows that I'm not the first one to encounter this
issue, but I cound't find any exact solution anywhere. Here's
a short list of ideas:
- use osd snap trim priority = 1. This is reported as not
as helpfull, as this is already lower than client IO priority
= 63;
- disabling fast-diff and object map features seem to
help, but I'm not sure what are the tradeoffs for this
scenario.