Re: High memory usage kills OSD while peering

Sage Weil <sage@xxxxxxxxxxxx> · Sun, 27 Aug 2017 02:00:09 +0000 (UTC)

On Sun, 27 Aug 2017, Linux Chips wrote:
> Hi again,
> now every thing almost sorted out. we had a few inconsistent shards that were
> killing the OSDs when recovering, we fixed some of them by removing the bad
> shards, and some by starting other OSDs with good shards.
> what is stopping us now, is that one OSD had a corrupted leveldb and refuses
> to start.
> not sure how that hapened, but i asume is due to the many times the node/osd
> died from lack of memory.
> I am also not sure if we should continue the discussion here, or start a new
> thread.
> 
> the osd (262) is showing those logs upon start:
> 
> 2017-08-26 17:07:17.915861 7fbd8e4cbd00  0 set uid:gid to 0:0 (:)
> 2017-08-26 17:07:17.915875 7fbd8e4cbd00  0 ceph version 12.1.4
> (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc), process (unknown),
> pid 26713
> 2017-08-26 17:07:17.927085 7fbd8e4cbd00  0 pidfile_write: ignore empty
> --pid-file
> 2017-08-26 17:07:17.951358 7fbd8e4cbd00  0 load: jerasure load: lrc load: isa
> 2017-08-26 17:07:17.951602 7fbd8e4cbd00  0
> filestore(/var/lib/ceph/osd/ceph-262) backend xfs (magic 0x58465342)
> 2017-08-26 17:07:17.952164 7fbd8e4cbd00  0
> filestore(/var/lib/ceph/osd/ceph-262) backend xfs (magic 0x58465342)
> 2017-08-26 17:07:17.952977 7fbd8e4cbd00  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-262) detect_features: FIEMAP
> ioctl is disabled via 'filestore fiemap' config option
> 2017-08-26 17:07:17.952983 7fbd8e4cbd00  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-262) detect_features:
> SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
> 2017-08-26 17:07:17.952985 7fbd8e4cbd00  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-262) detect_features: splice()
> is disabled via 'filestore splice' config option
> 2017-08-26 17:07:17.953309 7fbd8e4cbd00  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-262) detect_features: syncfs(2)
> syscall fully supported (by glibc and kernel)
> 2017-08-26 17:07:17.953797 7fbd8e4cbd00  0
> xfsfilestorebackend(/var/lib/ceph/osd/ceph-262) detect_feature: extsize is
> disabled by conf
> 2017-08-26 17:07:17.954628 7fbd8e4cbd00  0
> filestore(/var/lib/ceph/osd/ceph-262) start omap initiation
> 2017-08-26 17:07:17.957166 7fbd8e4cbd00 -1
> filestore(/var/lib/ceph/osd/ceph-262) mount(1724): Error initializing leveldb
> : Corruption: error in middle of record
> 
> 2017-08-26 17:07:17.957179 7fbd8e4cbd00 -1 osd.262 0 OSD:init: unable to mount
> object store
> 2017-08-26 17:07:17.957183 7fbd8e4cbd00 -1  ** ERROR: osd init failed: (1)
> Operation not permitted
> 
> ceph-objectstore-tool shows similar errors.
> 
> so, we figured it is only one OSD and we can go without it. we marked it lost,
> pgs started to peer and got active. but 5 remain in the incomplete state. and
> te pg query shows:
> 
> ...
>     "recovery_state": [
>         {
>             "name": "Started/Primary/Peering/Incomplete",
>             "enter_time": "2017-08-26 22:59:03.044623",
>             "comment": "not enough complete instances of this PG"
>         },
>         {
>             "name": "Started/Primary/Peering",
>             "enter_time": "2017-08-26 22:59:02.540748",
>             "past_intervals": [
>                 {
>                     "first": "959669",
>                     "last": "1090812",
>                     "all_participants": [
>                         {
>                             "osd": 258
>                         },
>                         {
>                             "osd": 262
>                         },
>                         {
>                             "osd": 338
>                         },
>                         {
>                             "osd": 545
>                         },
>                         {
>                             "osd": 549
>                         }
>                     ],
>                     "intervals": [
>                         {
>                             "first": "964880",
>                             "last": "964924",
>                             "acting": "262"
>                         },
>                         {
>                             "first": "978855",
>                             "last": "978956",
>                             "acting": "545"
>                         },
>                         {
>                             "first": "989628",
>                             "last": "989808",
>                             "acting": "258"
>                         },
>                         {
>                             "first": "992614",
>                             "last": "992975",
>                             "acting": "549"
>                         },
>                         {
>                             "first": "1085148",
>                             "last": "1090812",
>                             "acting": "338"
>                         }
>                     ]
>                 }
>             ],
>             "probing_osds": [
>                 "258",
>                 "338",
>                 "545",
>                 "549"
>             ],
>             "down_osds_we_would_probe": [
>                 262
>             ],
>             "peering_blocked_by": [],
>             "peering_blocked_by_detail": [
>                 {
>                     "detail": "peering_blocked_by_history_les_bound"
>                 }
>             ]
>         },
> ...
> 
> not sure wat that detail "peering_blocked_by_history_les_bound" is, and not
> sure how to proceed. i googled it, came up with nothing useful.
> all the incomplete pgs have the same detail as the above and similar recovery
> state.

It means that the pg metadata suggests that the PG may have gone active 
elsewhere, but we don't actually have any evidence that there were newer 
updates.  Since that OSD won't start and you can't extract the needed PGs 
from it with ceph-objectstore-tool export (or maybe you can get it from 
elsewhere?) there isn't much to lose by bypassing the check.  The config 
option has to be set to true on the primary OSD for the PG and peering 
retriggered (e.g., by marking the primary down with 'ceph osd down NN').

I'd test it on the 0 object PGs first :)

sage

> > ceph pg ls | grep incomplete
> 18.54b         0                  0        0         0       0  0
> 2739                 2739                              incomplete 2017-08-26
> 23:15:46.705071   46889'4277      1091150:314001                  [332,253]
> 332                                         [332,253]            332
> 46889'4277 2017-08-04 03:15:58.381025        46889'4277 2017-07-29
> 06:47:30.337673 
> 
> 19.54a      5950                  0        0         0       0 26108435266
> 3019                 3019                                      incomplete
> 2017-08-26 23:15:46.705156     961411'873129    1091150:58116482
> [332,253]        332
> [332,253]          332     960118'872495 2017-08-04 03:12:33.647414
> 952850'868978 2017-07-02 15:53:08.565948
> 19.608         0                  0        0         0       0  0
> 0                    0                              incomplete 2017-08-26
> 22:59:03.044649          0'0         1091150:428                  [258,338]
> 258                                         [258,338]            258
> 960118'862299 2017-08-04 03:01:57.011411     958900'861456 2017-07-28
> 02:33:29.476119
> 19.8bb         0                  0        0         0       0  0
> 0                    0                              incomplete 2017-08-26
> 22:59:02.946453          0'0         1091150:339                  [260,331]
> 260                                         [260,331]            260
> 960114'866811 2017-08-03 04:51:42.117840     952850'864443 2017-07-08
> 02:48:37.958357
> 19.dd3      5864                  0        0         0       0 25600089555
> 3094                 3094                                      incomplete
> 2017-08-26 17:20:07.948285     961411'865657    1091150:72381143
> [263,142]        263
> [263,142]          263     960118'865078 2017-08-25 17:32:06.181006
> 960118'865078 2017-08-25 17:32:06.181006
> 
> 
> I also noticed that some of those have 0 objects in them despite the dir in
> one of the osds have objects in it.
> these pools are replica 2
> 
> 
> thanks
> ali
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html