On Fri, Nov 23, 2012 at 12:35 AM, Sage Weil <sage@xxxxxxxxxxx> wrote: > On Thu, 22 Nov 2012, Andrey Korolyov wrote: >> Hi, >> >> In the recent versions Ceph introduces some unexpected behavior for >> the permanent connections (VM or kernel clients) - after crash >> recovery, I/O will hang on the next planned scrub on the following >> scenario: >> >> - launch a bunch of clients doing non-intensive writes, >> - lose one or more osd, mark them down, wait for recovery completion, >> - do a slow scrub, e.g. scrubbing one osd per 5m, inside bash script, >> or wait for ceph to do the same, >> - observe a raising number of pgs stuck in the active+clean+scrubbing >> state (they took a master role from ones which was on killed osd and >> almost surely they are being written in time of crash), >> - some time later, clients will hang hardly and ceph log introduce >> stuck(old) I/O requests. >> >> The only one way to return clients back without losing their I/O state >> is per-osd restart, which also will help to get rid of >> active+clean+scrubbing pgs. >> >> First of all, I`ll be happy to help to solve this problem by providing >> logs. > > If you can reproduce this behavior with 'debug osd = 20' and 'debug ms = > 1' logging on the OSD, that would be wonderful! > I have tested slightly different recovery flow, please see below. Since there is no real harm, like frozen I/O, placement groups also was stuck forever on the active+clean+scrubbing state, until I restarted all osds (end of the log): http://xdel.ru/downloads/ceph-log/recover-clients-later-than-osd.txt.gz - start the healthy cluster - start persistent clients - add an another host with pair of OSDs, let them be in the data placement - wait for data to rearrange - [22:06 timestamp] mark OSDs out or simply kill them and wait(since I have an 1/2 hour delay on readjust in such case, I did ``ceph osd out'' manually) - watch for data to rearrange again - [22:51 timestamp] when it ends, start a manual rescrub, with non-zero active+clean+scrubbing-state placement groups at the end of process which `ll stay in this state forever until something happens After that, I can restart osds one per one, if I want to get rid of scrubbing states immediately and then do deep-scrub(if I don`t, those states will return at next ceph self-scrubbing) or do per-osd deep-scrub, if I have a lot of time. The case I have described in the previous message took place when I remove osd from data placement which existed on the moment when client(s) have started and indeed it is more harmful than current one(frozen I/O leads to hanging entire guest, for example). Since testing those flow took a lot of time, I`ll send logs related to this case tomorrow. >> Second question is not directly related to this problem, but I >> have thought on for a long time - is there a planned features to >> control scrub process more precisely, e.g. pg scrub rate or scheduled >> scrub, instead of current set of timeouts which of course not very >> predictable on when to run? > > Not yet. I would be interested in hearing what kind of control/config > options/whatever you (and others) would like to see! Of course it will be awesome to have any determined scheduler or at least an option to disable automated scrubbing, since it is not very determined in time and deep-scrub eating a lot of I/O if command issued against entire OSD. Rate limiting is not in the first place, at least it may be recreated in external script, but for those who prefer to leave control to Ceph, it may be very useful. Thanks! -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html