Re: Troubleshooting hanging storage backend whenever there is any cluster change

Stefan Priebe - Profihost AG <s.priebe@xxxxxxxxxxxx> · Fri, 19 Oct 2018 20:05:08 +0200

Hi,

we were able to solve these issues. We switched bcache OSDs from ssd to
hdd in the ceph osd tree and lowered max recover from 3 to 1.

Thanks for your help!

Greets,
Stefan
Am 18.10.2018 um 15:42 schrieb David Turner:
> What are you OSD node stats?  CPU, RAM, quantity and size of OSD disks. 
> You might need to modify some bluestore settings to speed up the time it
> takes to peer or perhaps you might just be underpowering the amount of
> OSD disks you're trying to do and your servers and OSD daemons are going
> as fast as they can.
> On Sat, Oct 13, 2018 at 4:08 PM Stefan Priebe - Profihost AG
> <s.priebe@xxxxxxxxxxxx <mailto:s.priebe@xxxxxxxxxxxx>> wrote:
> 
>     and a 3rd one:
> 
>         health: HEALTH_WARN
>                 1 MDSs report slow metadata IOs
>                 1 MDSs report slow requests
> 
>     2018-10-13 21:44:08.150722 mds.cloud1-1473 [WRN] 7 slow requests, 1
>     included below; oldest blocked for > 199.922552 secs
>     2018-10-13 21:44:08.150725 mds.cloud1-1473 [WRN] slow request 34.829662
>     seconds old, received at 2018-10-13 21:43:33.321031:
>     client_request(client.216121228:929114 lookup #0x1/.active.lock
>     2018-10-13 21:43:33.321594 caller_uid=0, caller_gid=0{}) currently
>     failed to rdlock, waiting
> 
>     The relevant OSDs are bluestore again running at 100% I/O:
> 
>     iostat shows:
>     sdi              77,00     0,00  580,00   97,00 511032,00   972,00
>     1512,57    14,88   22,05   24,57    6,97   1,48 100,00
> 
>     so it reads with 500MB/s which completely saturates the osd. And it does
>     for > 10 minutes.
> 
>     Greets,
>     Stefan
> 
>     Am 13.10.2018 um 21:29 schrieb Stefan Priebe - Profihost AG:
>     >
>     > ods.19 is a bluestore osd on a healthy 2TB SSD.
>     >
>     > Log of osd.19 is here:
>     > https://pastebin.com/raw/6DWwhS0A
>     >
>     > Am 13.10.2018 um 21:20 schrieb Stefan Priebe - Profihost AG:
>     >> Hi David,
>     >>
>     >> i think this should be the problem - form a new log from today:
>     >>
>     >> 2018-10-13 20:57:20.367326 mon.a [WRN] Health check update: 4
>     osds down
>     >> (OSD_DOWN)
>     >> ...
>     >> 2018-10-13 20:57:41.268674 mon.a [WRN] Health check update:
>     Reduced data
>     >> availability: 3 pgs peering (PG_AVAILABILITY)
>     >> ...
>     >> 2018-10-13 20:58:08.684451 mon.a [WRN] Health check failed: 1
>     osds down
>     >> (OSD_DOWN)
>     >> ...
>     >> 2018-10-13 20:58:22.841210 mon.a [WRN] Health check failed:
>     Reduced data
>     >> availability: 8 pgs inactive (PG_AVAILABILITY)
>     >> ....
>     >> 2018-10-13 20:58:47.570017 mon.a [WRN] Health check update:
>     Reduced data
>     >> availability: 5 pgs inactive (PG_AVAILABILITY)
>     >> ...
>     >> 2018-10-13 20:58:49.142108 osd.19 [WRN] Monitor daemon marked osd.19
>     >> down, but it is still running
>     >> 2018-10-13 20:58:53.750164 mon.a [WRN] Health check update:
>     Reduced data
>     >> availability: 3 pgs inactive (PG_AVAILABILITY)
>     >> ...
>     >>
>     >> so there is a timeframe of > 90s whee PGs are inactive and unavail -
>     >> this would at least explain stalled I/O to me?
>     >>
>     >> Greets,
>     >> Stefan
>     >>
>     >>
>     >> Am 12.10.2018 um 15:59 schrieb David Turner:
>     >>> The PGs per OSD does not change unless the OSDs are marked out.  You
>     >>> have noout set, so that doesn't change at all during this test. 
>     All of
>     >>> your PGs peered quickly at the beginning and then were
>     active+undersized
>     >>> the rest of the time, you never had any blocked requests, and
>     you always
>     >>> had 100MB/s+ client IO.  I didn't see anything wrong with your
>     cluster
>     >>> to indicate that your clients had any problems whatsoever
>     accessing data.
>     >>>
>     >>> Can you confirm that you saw the same problems while you were
>     running
>     >>> those commands?  The next thing would seem that possibly a
>     client isn't
>     >>> getting an updated OSD map to indicate that the host and its
>     OSDs are
>     >>> down and it's stuck trying to communicate with host7.  That would
>     >>> indicate a potential problem with the client being unable to
>     communicate
>     >>> with the Mons maybe?  Have you completely ruled out any network
>     problems
>     >>> between all nodes and all of the IPs in the cluster.  What does your
>     >>> client log show during these times?
>     >>>
>     >>> On Fri, Oct 12, 2018 at 8:35 AM Nils Fahldieck - Profihost AG
>     >>> <n.fahldieck@xxxxxxxxxxxx <mailto:n.fahldieck@xxxxxxxxxxxx>
>     <mailto:n.fahldieck@xxxxxxxxxxxx <mailto:n.fahldieck@xxxxxxxxxxxx>>>
>     wrote:
>     >>>
>     >>>     Hi, in our `ceph.conf` we have:
>     >>>
>     >>>       mon_max_pg_per_osd = 300
>     >>>
>     >>>     While the host is offline (9 OSDs down):
>     >>>
>     >>>       4352 PGs * 3 / 62 OSDs ~ 210 PGs per OSD
>     >>>
>     >>>     If all OSDs are online:
>     >>>
>     >>>       4352 PGs * 3 / 71 OSDs ~ 183 PGs per OSD
>     >>>
>     >>>     ... so this doesn't seem to be the issue.
>     >>>
>     >>>     If I understood you right, that's what you've meant. If I
>     got you wrong,
>     >>>     would you mind to point to one of those threads you mentioned?
>     >>>
>     >>>     Thanks :)
>     >>>
>     >>>     Am 12.10.2018 um 14:03 schrieb Burkhard Linke:
>     >>>     > Hi,
>     >>>     >
>     >>>     >
>     >>>     > On 10/12/2018 01:55 PM, Nils Fahldieck - Profihost AG wrote:
>     >>>     >> I rebooted a Ceph host and logged `ceph status` & `ceph
>     health
>     >>>     detail`
>     >>>     >> every 5 seconds. During this I encountered 'PG_AVAILABILITY
>     >>>     Reduced data
>     >>>     >> availability: pgs peering'. At the same time some VMs hung as
>     >>>     described
>     >>>     >> before.
>     >>>     >
>     >>>     > Just a wild guess... you have 71 OSDs and about 4500 PG
>     with size=3.
>     >>>     > 13500 PG instance overall, resulting in ~190 PGs per OSD
>     under normal
>     >>>     > circumstances.
>     >>>     >
>     >>>     > If one host is down and the PGs have to re-peer, you might
>     reach the
>     >>>     > limit of 200 PG/OSDs on some of the OSDs, resulting in
>     stuck peering.
>     >>>     >
>     >>>     > You can try to raise this limit. There are several threads
>     on the
>     >>>     > mailing list about this.
>     >>>     >
>     >>>     > Regards,
>     >>>     > Burkhard
>     >>>     >
>     >>>     _______________________________________________
>     >>>     ceph-users mailing list
>     >>>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     <mailto:ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>>
>     >>>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >>>
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com