Re: Hangup during scrubbing - possible solutions

Sage Weil <sage@xxxxxxxxxxx> · Thu, 22 Nov 2012 12:35:27 -0800 (PST)

On Thu, 22 Nov 2012, Andrey Korolyov wrote:
> Hi,
> 
> In the recent versions Ceph introduces some unexpected behavior for
> the permanent connections (VM or kernel clients) - after crash
> recovery, I/O will hang on the next planned scrub on the following
> scenario:
> 
> - launch a bunch of clients doing non-intensive writes,
> - lose one or more osd, mark them down, wait for recovery completion,
> - do a slow scrub, e.g. scrubbing one osd per 5m, inside bash script,
> or wait for ceph to do the same,
> - observe a raising number of pgs stuck in the active+clean+scrubbing
> state (they took a master role from ones which was on killed osd and
> almost surely they are being written in time of crash),
> - some time later, clients will hang hardly and ceph log introduce
> stuck(old) I/O requests.
> 
> The only one way to return clients back without losing their I/O state
> is per-osd restart, which also will help to get rid of
> active+clean+scrubbing pgs.
> 
> First of all, I`ll be happy to help to solve this problem by providing
> logs.

If you can reproduce this behavior with 'debug osd = 20' and 'debug ms = 
1' logging on the OSD, that would be wonderful!

> Second question is not directly related to this problem, but I
> have thought on for a long time - is there a planned features to
> control scrub process more precisely, e.g. pg scrub rate or scheduled
> scrub, instead of current set of timeouts which of course not very
> predictable on when to run?

Not yet.  I would be interested in hearing what kind of control/config 
options/whatever you (and others) would like to see!

Thanks-
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html