Hi, we were able to solve these issues. We switched bcache OSDs from ssd to hdd in the ceph osd tree and lowered max recover from 3 to 1. Thanks for your help! Greets, Stefan Am 18.10.2018 um 15:42 schrieb David Turner: > What are you OSD node stats? CPU, RAM, quantity and size of OSD disks. > You might need to modify some bluestore settings to speed up the time it > takes to peer or perhaps you might just be underpowering the amount of > OSD disks you're trying to do and your servers and OSD daemons are going > as fast as they can. > On Sat, Oct 13, 2018 at 4:08 PM Stefan Priebe - Profihost AG > <s.priebe@xxxxxxxxxxxx <mailto:s.priebe@xxxxxxxxxxxx>> wrote: > > and a 3rd one: > > health: HEALTH_WARN > 1 MDSs report slow metadata IOs > 1 MDSs report slow requests > > 2018-10-13 21:44:08.150722 mds.cloud1-1473 [WRN] 7 slow requests, 1 > included below; oldest blocked for > 199.922552 secs > 2018-10-13 21:44:08.150725 mds.cloud1-1473 [WRN] slow request 34.829662 > seconds old, received at 2018-10-13 21:43:33.321031: > client_request(client.216121228:929114 lookup #0x1/.active.lock > 2018-10-13 21:43:33.321594 caller_uid=0, caller_gid=0{}) currently > failed to rdlock, waiting > > The relevant OSDs are bluestore again running at 100% I/O: > > iostat shows: > sdi 77,00 0,00 580,00 97,00 511032,00 972,00 > 1512,57 14,88 22,05 24,57 6,97 1,48 100,00 > > so it reads with 500MB/s which completely saturates the osd. And it does > for > 10 minutes. > > Greets, > Stefan > > Am 13.10.2018 um 21:29 schrieb Stefan Priebe - Profihost AG: > > > > ods.19 is a bluestore osd on a healthy 2TB SSD. > > > > Log of osd.19 is here: > > https://pastebin.com/raw/6DWwhS0A > > > > Am 13.10.2018 um 21:20 schrieb Stefan Priebe - Profihost AG: > >> Hi David, > >> > >> i think this should be the problem - form a new log from today: > >> > >> 2018-10-13 20:57:20.367326 mon.a [WRN] Health check update: 4 > osds down > >> (OSD_DOWN) > >> ... > >> 2018-10-13 20:57:41.268674 mon.a [WRN] Health check update: > Reduced data > >> availability: 3 pgs peering (PG_AVAILABILITY) > >> ... > >> 2018-10-13 20:58:08.684451 mon.a [WRN] Health check failed: 1 > osds down > >> (OSD_DOWN) > >> ... > >> 2018-10-13 20:58:22.841210 mon.a [WRN] Health check failed: > Reduced data > >> availability: 8 pgs inactive (PG_AVAILABILITY) > >> .... > >> 2018-10-13 20:58:47.570017 mon.a [WRN] Health check update: > Reduced data > >> availability: 5 pgs inactive (PG_AVAILABILITY) > >> ... > >> 2018-10-13 20:58:49.142108 osd.19 [WRN] Monitor daemon marked osd.19 > >> down, but it is still running > >> 2018-10-13 20:58:53.750164 mon.a [WRN] Health check update: > Reduced data > >> availability: 3 pgs inactive (PG_AVAILABILITY) > >> ... > >> > >> so there is a timeframe of > 90s whee PGs are inactive and unavail - > >> this would at least explain stalled I/O to me? > >> > >> Greets, > >> Stefan > >> > >> > >> Am 12.10.2018 um 15:59 schrieb David Turner: > >>> The PGs per OSD does not change unless the OSDs are marked out. You > >>> have noout set, so that doesn't change at all during this test. > All of > >>> your PGs peered quickly at the beginning and then were > active+undersized > >>> the rest of the time, you never had any blocked requests, and > you always > >>> had 100MB/s+ client IO. I didn't see anything wrong with your > cluster > >>> to indicate that your clients had any problems whatsoever > accessing data. > >>> > >>> Can you confirm that you saw the same problems while you were > running > >>> those commands? The next thing would seem that possibly a > client isn't > >>> getting an updated OSD map to indicate that the host and its > OSDs are > >>> down and it's stuck trying to communicate with host7. That would > >>> indicate a potential problem with the client being unable to > communicate > >>> with the Mons maybe? Have you completely ruled out any network > problems > >>> between all nodes and all of the IPs in the cluster. What does your > >>> client log show during these times? > >>> > >>> On Fri, Oct 12, 2018 at 8:35 AM Nils Fahldieck - Profihost AG > >>> <n.fahldieck@xxxxxxxxxxxx <mailto:n.fahldieck@xxxxxxxxxxxx> > <mailto:n.fahldieck@xxxxxxxxxxxx <mailto:n.fahldieck@xxxxxxxxxxxx>>> > wrote: > >>> > >>> Hi, in our `ceph.conf` we have: > >>> > >>> mon_max_pg_per_osd = 300 > >>> > >>> While the host is offline (9 OSDs down): > >>> > >>> 4352 PGs * 3 / 62 OSDs ~ 210 PGs per OSD > >>> > >>> If all OSDs are online: > >>> > >>> 4352 PGs * 3 / 71 OSDs ~ 183 PGs per OSD > >>> > >>> ... so this doesn't seem to be the issue. > >>> > >>> If I understood you right, that's what you've meant. If I > got you wrong, > >>> would you mind to point to one of those threads you mentioned? > >>> > >>> Thanks :) > >>> > >>> Am 12.10.2018 um 14:03 schrieb Burkhard Linke: > >>> > Hi, > >>> > > >>> > > >>> > On 10/12/2018 01:55 PM, Nils Fahldieck - Profihost AG wrote: > >>> >> I rebooted a Ceph host and logged `ceph status` & `ceph > health > >>> detail` > >>> >> every 5 seconds. During this I encountered 'PG_AVAILABILITY > >>> Reduced data > >>> >> availability: pgs peering'. At the same time some VMs hung as > >>> described > >>> >> before. > >>> > > >>> > Just a wild guess... you have 71 OSDs and about 4500 PG > with size=3. > >>> > 13500 PG instance overall, resulting in ~190 PGs per OSD > under normal > >>> > circumstances. > >>> > > >>> > If one host is down and the PGs have to re-peer, you might > reach the > >>> > limit of 200 PG/OSDs on some of the OSDs, resulting in > stuck peering. > >>> > > >>> > You can try to raise this limit. There are several threads > on the > >>> > mailing list about this. > >>> > > >>> > Regards, > >>> > Burkhard > >>> > > >>> _______________________________________________ > >>> ceph-users mailing list > >>> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> > <mailto:ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>> > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>> > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com