Re: stuck recovery for many days, help needed

Vincent Godin <vince.mlist@xxxxxxxxx> · Thu, 21 Sep 2017 22:20:58 +0200

10 GB of RAM per OSD process is huge !!! (It looks like a very old bug
in hammer)
You should give more informations : ceph.conf, OS version, hardware
config, debug level

2017-09-21 20:07 GMT+02:00 Wyllys Ingersoll <wyllys.ingersoll@xxxxxxxxxxxxxx>:
> I have investigated the peering issues (down to 3 now), mostly it's
> because the OSDs they are waiting on refuse to come up and stay up
> long enough to complete the operation requested due to issue #1 below,
> ceph-osd assertion errors causing crashes.
>
> During heavy recovery, and after running for long periods of time, the
> OSDs consume far more than 1GB of RAM.  Here is an example (clipped
> from 'top'), the server has 10 ceph-osd processes, not all shown here
> but you get the idea.  They all consume 10-20+GB of memory.
>
>     PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
> COMMAND
>  699905 ceph      20   0 25.526g 0.021t 125472 S  70.4 17.3  37:31.26
> ceph-osd
>  662712 ceph      20   0 10.958g 6.229g 238392 S  39.9  5.0  98:34.80
> ceph-osd
>  692981 ceph      20   0 14.940g 5.845g  84408 S  39.9  4.6  89:36.22
> ceph-osd
>  553786 ceph      20   0 29.059g 0.011t 231992 S  35.5  9.1 612:15.30
> ceph-osd
>  656799 ceph      20   0 27.610g 0.014t 197704 S  25.9 11.5 399:02.59
> ceph-osd
>  662727 ceph      20   0 18.703g 0.013t 105012 S   4.7 10.9  90:20.22
> ceph-osd
>
> On Thu, Sep 21, 2017 at 1:47 PM, Vincent Godin <vince.mlist@xxxxxxxxx> wrote:
>> Hello,
>>
>> You should first investigate on the 13 pgs which refuse to peer. They
>> probably refuse to peer because they're waiting for some OSDs with
>> more up-to-date datas. Try to focus on one pg and restart the OSD the
>> pg is waiting for
>>
>> I don't understand very well your memory problem : my hosts have 64GB
>> of RAM and (20 x 6 TB SATA + 5 x 400GB SSD) and i have encountered no
>> memory problems (i'm on 10.2.7). An OSD consumes about 1GB of RAM. How
>> many OSD process are running on one of your host and how much RAM are
>> used by OSD process ? It may be your main problem
>>
>> 2017-09-21 16:08 GMT+02:00 Wyllys Ingersoll <wyllys.ingersoll@xxxxxxxxxxxxxx>:
>>> I have a damaged cluster that has been recovering for over a week and
>>> is still not getting healthy.  It will get to a point and then the
>>> "degraded" recovery objects count stops going down and eventually the
>>> "mispaced" object count also stops going down and recovery basically
>>> stops.
>>>
>>> Problems noted:
>>>
>>>  - Memory exhaustion on storage servers. We have 192GB RAM and 64TB of
>>> disks (though only 40TB of disks are marked "up/in" the cluster
>>> currently to avoid crashing issues and some suspected bad disks).
>>>
>>> - OSD crashes.  We have a number of OSDs that repeatedly crash on or
>>> shortly after starting up and joining back into the cluster (crash
>>> logs already sent in to this list early this week).  Possibly due to
>>> hard drive issues, but none of them are marked as failing by SMART
>>> utilities.
>>>
>>> - Too many cephfs snapshots.  We have a cephfs with over 4800
>>> snapshots.  cephfs is currently unavailable during the recovery, but
>>> when it *was* available, deleting a single snapshot threw the system
>>> into a bad state - thousands of requests would become blocked, cephfs
>>> would become blocked and the entire cluster basically went to hell.  I
>>> believe a bug has been filed for this, but I think the impact is more
>>> severe and critical than originally suspected.
>>>
>>>
>>> Fixes attempted:
>>> - Upgraded everything to ceph 10.2.9 (was originally 10.2.7)
>>> - Upgraded kernels on storage servers to 4.13.1 to get around XFS problems.
>>> - disabled scrub and deep scrub
>>> - attempting to bring more OSDs online, but its tricky because we end
>>> up either running into memory exhaustion problems or the OSDs crash
>>> shortly after starting making them essentially useless.
>>>
>>>
>>> Currently our status looks like this (MDSs are disabled intentionally
>>> for now, having them online makes no difference for recovery or cephfs
>>> availability):
>>>
>>>      health HEALTH_ERR
>>>             25 pgs are stuck inactive for more than 300 seconds
>>>             1398 pgs backfill_wait
>>>             72 pgs backfilling
>>>             38 pgs degraded
>>>             13 pgs down
>>>             1 pgs incomplete
>>>             2 pgs inconsistent
>>>             13 pgs peering
>>>             35 pgs recovering
>>>             37 pgs stuck degraded
>>>             25 pgs stuck inactive
>>>             1519 pgs stuck unclean
>>>             33 pgs stuck undersized
>>>             34 pgs undersized
>>>             81 requests are blocked > 32 sec
>>>             recovery 351883/51815427 objects degraded (0.679%)
>>>             recovery 4920116/51815427 objects misplaced (9.495%)
>>>             recovery 152/17271809 unfound (0.001%)
>>>             15 scrub errors
>>>             mds rank 0 has failed
>>>             mds cluster is degraded
>>>             noscrub,nodeep-scrub flag(s) set
>>>      monmap e1: 3 mons at
>>> {mon01=10.16.51.21:6789/0,mon02=10.16.51.22:6789/0,mon03=10.16.51.23:6789/0}
>>>             election epoch 192, quorum 0,1,2 mon01,mon02,mon03
>>>       fsmap e18157: 0/1/1 up, 1 failed
>>>      osdmap e254054: 93 osds: 77 up, 76 in; 1511 remapped pgs
>>>             flags noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
>>>       pgmap v36166916: 16200 pgs, 13 pools, 25494 GB data, 16867 kobjects
>>>             86259 GB used, 139 TB / 223 TB avail
>>>
>>>
>>> Any suggestions as to what to look for or how to try and get this
>>> cluster healthy soon would be much appreciated, its literally been
>>> more than 2 weeks of battling with various issues and we are no closer
>>> to a healthy usable cluster.
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html