PGs switching to the peering state after a failure is normal and expected. The important thing is how long they stay in that state; it shouldn't be longer than a few seconds. It looks like less than 5 seconds from your log. What might help here is the ceph -w log (or mon cluster log file) during an outage. Also, get rid of that min_size = 1 setting, that will bite you in the long run. Paul Am Fr., 12. Okt. 2018 um 23:27 Uhr schrieb Stefan Priebe - Profihost AG <s.priebe@xxxxxxxxxxxx>: > > Hi David, > > Am 12.10.2018 um 15:59 schrieb David Turner: > > The PGs per OSD does not change unless the OSDs are marked out. You > > have noout set, so that doesn't change at all during this test. All of > > your PGs peered quickly at the beginning and then were active+undersized > > the rest of the time, you never had any blocked requests, and you always > > had 100MB/s+ client IO. I didn't see anything wrong with your cluster > > to indicate that your clients had any problems whatsoever accessing data. > > > > Can you confirm that you saw the same problems while you were running > > those commands? The next thing would seem that possibly a client isn't > > getting an updated OSD map to indicate that the host and its OSDs are > > down and it's stuck trying to communicate with host7. That would > > indicate a potential problem with the client being unable to communicate > > with the Mons maybe? > May be but what about this status > 'PG_AVAILABILITY Reduced data availability: pgs peering' > > See the log here: https://pastebin.com/wxUKzhgB > > PG_AVAILABILITY is noted at timestamps [2018-10-12 12:16:15.403394] and > [2018-10-12 12:17:40.072655]. > > And why does Ceph docs say: > > Data availability is reduced, meaning that the cluster is unable to > service potential read or write requests for some data in the cluster. > Specifically, one or more PGs is in a state that does not allow IO > requests to be serviced. Problematic PG states include peering, stale, > incomplete, and the lack of active (if those conditions do not clear > quickly). > > > Greets, > Stefan > > > > On Fri, Oct 12, 2018 at 8:35 AM Nils Fahldieck - Profihost AG > > <n.fahldieck@xxxxxxxxxxxx <mailto:n.fahldieck@xxxxxxxxxxxx>> wrote: > > > > Hi, in our `ceph.conf` we have: > > > > mon_max_pg_per_osd = 300 > > > > While the host is offline (9 OSDs down): > > > > 4352 PGs * 3 / 62 OSDs ~ 210 PGs per OSD > > > > If all OSDs are online: > > > > 4352 PGs * 3 / 71 OSDs ~ 183 PGs per OSD > > > > ... so this doesn't seem to be the issue. > > > > If I understood you right, that's what you've meant. If I got you wrong, > > would you mind to point to one of those threads you mentioned? > > > > Thanks :) > > > > Am 12.10.2018 um 14:03 schrieb Burkhard Linke: > > > Hi, > > > > > > > > > On 10/12/2018 01:55 PM, Nils Fahldieck - Profihost AG wrote: > > >> I rebooted a Ceph host and logged `ceph status` & `ceph health > > detail` > > >> every 5 seconds. During this I encountered 'PG_AVAILABILITY > > Reduced data > > >> availability: pgs peering'. At the same time some VMs hung as > > described > > >> before. > > > > > > Just a wild guess... you have 71 OSDs and about 4500 PG with size=3. > > > 13500 PG instance overall, resulting in ~190 PGs per OSD under normal > > > circumstances. > > > > > > If one host is down and the PGs have to re-peer, you might reach the > > > limit of 200 PG/OSDs on some of the OSDs, resulting in stuck peering. > > > > > > You can try to raise this limit. There are several threads on the > > > mailing list about this. > > > > > > Regards, > > > Burkhard > > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com