Re: Removing Snapshots Killing Cluster Performance

Dan Van Der Ster <daniel.vanderster@xxxxxxx> · Mon, 1 Dec 2014 13:58:18 +0000

> On 01 Dec 2014, at 13:37, Daniel Schneller <daniel.schneller@xxxxxxxxxxxxxxxx> wrote:
> 
> On 2014-12-01 10:03:35 +0000, Dan Van Der Ster said:
> 
>> Which version of Ceph are you using? This could be related: http://tracker.ceph.com/issues/9487
> 
> Firefly. I had seen this ticket earlier (when deleting a whole pool) and hoped
> the backport of the fix would be available some time soon. I must admin, I did
> not look this up before posting, because I had forgotten about it.
> 
>> See "ReplicatedPG: don't move on to the next snap immediately"; basically, the OSD is getting into a tight loop "trimming" the snapshot objects. The fix above breaks out of that loop more frequently, and then you can use the osd snap trim sleep option to throttle it further. I’m not sure if the fix above will be sufficient if you have many objects to remove per snapshot.
> 
> Just so I get this right: With the fix alone you are not sure it would be "nice"
> enough, so adjusting the snap trim sleep option in addition might be needed?
> I assume the loop that will be broken up with 9487 does not take the sleep
> time into account?

You will probably need the osd snap trim sleep regardless. IIRC, the previous (i.e. current) behaviour of the snap trimmer is to “sleep” only once per PG. So if it takes many seconds to trim all the objects in a single PG, then the sleep is basically useless. The fix breaks out of a single PG, though I don’t recall how often the sleep happens in the new behaviour; every object, every N objects, etc...

If you want to watch the snap trim logs, you need debug_osd=10 to see the snap trim operations, and debug_osd=20 to see when the snap trim sleep occurs.

Once you have the logs and confirm that the snap trim is causing your problems, I would experiment with different snap trim sleep values. If you had the fix for #9487 in production, something between 0.01 and 0.05s probably makes sense. But without that fix, perhaps you want to try with a much larger sleep.

>> That commit is only in giant at the moment. The backport to dumpling is in the dumpling branch but not yet in a release, and firefly is still pending.
> 
> Holding my breath :)
> 
> Any thoughts on the other items I had in the original post?
> 
>>> 2) Is there a way to get a decent approximation of how much work
>>> deleting a specific snapshot will entail (in terms of objects, time,
>>> whatever)?

I don’t know.

>>> 3) Would SSD journals help here? Or any other hardware configuration
>>> change for that matter?

I’m not sure an SSD journal would help very much here.

But maybe the IO priority options will help in your case:

  osd disk thread ioprio class = idle
  osd disk thread ioprio priority = 0

(note that you also need to use the cfq io scheduler on your OSD disks).

This should help if client IOs and snap trim IOs are competing for disk bandwidth (though there may be a PG lock blocking client IOs..)

Cheers, Dan

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com