Experience going through rebalancing with active VMs / questions

Lionel Bouton <lionel+ceph@xxxxxxxxxxx> · Sat, 02 May 2015 17:33:11 +0200

Hi,

we are currently running the latest firefly (0.80.9) and we have
difficulties maintaining good throughput when Ceph is
backfilling/recovering and/or deep-scrubing after an outage. This got to
the point where when the VM using rbd start misbehaving (load rising,
some simple SQL update queries taking several seconds) I use a script
looping through tunable periods of max_backfills/max_recoveries = 1/0.

We recently had power outages and couldn't restart all the OSDs (one
server needed special care) so as we only have 4 servers with 6 OSDs
each, there was a fair amount or rebalancing.

What seems to work with our current load is the following :
1/ disable deep-scrub and scrub (deactivating scrub might not be needed
: it doesn't seem to have much of an impact on performance),
2/ activate the max_backfills/recoveries = 1/0 loop with 30 seconds for
each,
3/ wait for the rebalancing to finish, activate scrub,
4/ activate the (un)set nodeep_scrub loop with 30 seconds unset, 120
seconds set,
5/ wait for deep-scrubs to catch up (ie: none active during several
consecutive 30 seconds "unset" periods),
6/ revert to normal configuration.

This can take about a day for us (we have about 800GB per OSD when in
the degraded 3 servers configuration).

I have two ideas/questions :

1/ Deep scrub scheduling

Deep scrubs can happen in salves with the current algorithm which really
harms performance even with CFQ and lower priorities. We have a total of
1216 pg (1024 for our rbd pool) and a osd deep scrub interval of 2
weeks. This means that on average a deep scrub could happen about every
16 minutes globally.
When recovering from an outage the current algorithm wants to catch up
and even if only one scrub per OSD can happen at a time, VM disk
accesses involves many OSDs so having multiple deep-scrubs on the whole
cluster seems to hurt a bit more than when only one happens at a time.
So a smoother distribution when catching up could help a lot (at least
our experience seems to point in this direction). I'm even considering
scheduling scrubs ahead of time, setting the interval to 2 weeks in
ceph.conf, but distributing them at a rate that targets a completion in
a week. Does this make any sense ? Is there any development in this
direction already (Feature request #7288 didn't seem to go this far and
#7540 had no activity) ?

2/ Bigger journals

There's not much said about how to tune journal size and filestore max
sync interval. I'm not sure what the drawbacks of using much larger
journals and max sync interval are.
Obviously a sync would be more costly, but if it takes less time to
complete than the journal takes to fill up even while there are
deep-scrub or backfills, I'm not sure how this would hurt performance.

In our case we have a bit less than 5GB of data per pg (for the rbd
pool) and use 5GB journals (on the same disk than the OSD in a separate
partition at the beginning of the disk).
I'm wondering if we could get a lower impact of deep-scrubs if we could
buffer more activity in the journal. If we could lower the rate at which
each OSD are doing deep-scrubs (in the way I'm thinking of scheduling
them in the previous point) I'm wondering if it could give time to an
OSD to catch up doing filestore syncs between them and avoid contention
between deep-scrubs/journal writes/filestore sync happening all at the
same time. I assume deep scrubs and journal writes are mostly sequential
so in our configuration we can assume ~half of the available disk
throughput is available for each of them. So if we want to avoid
filestore syncs during deep-scrubs, it seems to me we should have a
journal at least twice the size of our largests pgs and tune the
filestore max sync interval to at least the expected duration of a deep
scrub. What is worrying me is that in our current configuration this
would mean at least twice our journal size (10GB instead of 5GB) and
given half of a ~120MB/s throughput a max interval of ~90 seconds (we
use 30 seconds currently). This is far from the default values (and as
we use about 50% of the storage capacity and have a good pg/OSD ratio we
may target twice these values to support pgs twice as large as our
current ones).
Does that make any sense ?
I'm not sure how backfills and recoveries work: I couldn't find a way to
let the OSD wait a bit between each batch to give a chance to the
filestore sync to go through. If this idea makes sense for deep-scrubs I
assume it might work for backfills/recoveries to smooth I/Os too (if
they can be configured to pause a bit between batches).

Best regards,

Lionel
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com