Re: RADOS + deep scrubbing performance issues in production environment

Guang <guangyy@xxxxxxxxx> · Mon, 3 Feb 2014 21:40:07 +0800

+ceph-users.

Does anybody have the similar experience of scrubbing / deep-scrubbing?

Thanks,
Guang

On Jan 29, 2014, at 10:35 AM, Guang <yguang11@xxxxxxxxx> wrote:

Glad to see there are some discussion around scrubbing / deep-scrubbing.

We are experiencing the same that scrubbing could affect latency quite a bit and so far I found two slow patterns (dump_historic_ops): 1) waiting from being dispatched 2) waiting in the op working queue to be fetched by an available op thread. For the first slow pattern, it looks like there is lock (as dispatcher stop working for 2 seconds and then resume, same for scrubber thread), that needs further investigation. For the second slow pattern, as scrubbing brings more ops (for scrubbing check), that make the op thread's work load increase (client op has a lower priority), I think that could be improved by increasing the op thread number, I will confirm this analysis by adding more op threads and turn on scrubbing on OSD basis.

Does the above observation and analysis make sense?

Thanks,
Guang

On Jan 29, 2014, at 2:13 AM, Filippos Giannakos <philipgian@xxxxxxxx> wrote:

On Mon, Jan 27, 2014 at 10:45:48AM -0800, Sage Weil wrote:
There is also 

 ceph osd set noscrub

and then later

 ceph osd unset noscrub

I forget whether this pauses an in-progress PG scrub or just makes it stop 
when it gets to the next PG boundary.

sage

I bumped into those settings but I couldn't find any documentation about them.
When I first tried them, they didn't do anything immediately, so I thought they
weren't the answer. After your mention, I tried them again, and after a while
the deep-scrubbing stopped. So I'm guessing they stop scrubbing on the next PG
boundary.

I see from this thread and others before, that some people think it is a spindle
issue. I'm not sure that it is just that. Replicating it to an idle cluster that
can do more than 250MiB/seconds and pausing for 4-5 seconds on a single request,
sounds like an issue by itself. Maybe there is too much locking or not enough
priority to the actual I/O ? Plus, that idea of throttling deep scrubbing based
on the iops sounds appealing.

Kind Regards,
-- 
Filippos
<philipgian@xxxxxxxx>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com