Re: Dying OSDs

Jan Marquardt <jm@xxxxxxxxxxx> · Tue, 24 Apr 2018 17:14:34 +0200

Hi,

it's been a while, but we are still fighting with this issue.

As suggested we deleted all snapshots, but the errors still occur.

We were able to gather some more information:

The reason why they are crashing is this assert:
https://github.com/ceph/ceph/blob/luminous/src/osd/PrimaryLogPG.cc#L353

With debug 20 we see this right before the OSD crashes:

2018-04-24 13:59:38.047697 7f929ba0d700 20 osd.4 pg_epoch: 144994
pg[0.103( v 140091'469328 (125640'467824,140091'469328] lb
0:c0e04acc:::rbd_data.221bf2eb141f2.0000000000016379:head (bitwise)
local-lis/les=137681/137682 n=9535 ec=115/115 lis/c 144979/49591 les/c/f
144980/49596/0 144978/144979/144979) [4,17,2]/[2,17] r=-1 lpr=144979
pi=[49591,144979)/3 luod=0'0 crt=140091'469328 lcod 0'0 active+remapped]
 snapset 0=[]:[] legacy_snaps []

2018-04-24 16:34:54.558159 7f1c40e32700 20 osd.11 pg_epoch: 145549
pg[0.103( v 140091'469328 (125640'467824,140091'469328] lb
0:c0e04acc:::rbd_data.221bf2eb141f2.0000000000016379:head (bitwise)
local-lis/les=138310/138311 n=9535 ec=115/115 lis/c 145548/49591 les/c/f
145549/49596/0 145547/145548/145548) [11,17,2]/[2,17] r=-1 lpr=145548
pi=[49591,145548)/3 luod=0'0 crt=140091'469328 lcod 0'0 active+remapped]
 snapset 0=[]:[] legacy_snaps []

Which is caused from this code:
https://github.com/ceph/ceph/blob/luminous/src/osd/PrimaryLogPG.cc#L349-L350

Any help would really be appreciated.

Best Regards

Jan

Am 12.04.18 um 10:53 schrieb Paul Emmerich:
> Hi,
> 
> thanks, but unfortunately it's not the thing I suspected :(
> Anyways, there's something wrong with your snapshots, the log also
> contains a lot of entries like this:
> 
> 2018-04-09 06:58:53.703353 7fb8931a0700 -1 osd.28 pg_epoch: 88438
> pg[0.5d( v 88438'223279 (86421'221681,88438'223279]
> local-lis/les=87450/87451 n=5634 ec=115/115 lis/c 87450/87450 les/c/f
> 87451/87451/0 87352/87450/87450) [37,6,28] r=2 lpr=87450 luod=0'0
> crt=88438'223279 lcod 88438'223278 active] _scan_snaps no head for
> 0:ba087b0f:::rbd_data.221bf2eb141f2.0000000000001436:46aa (have MIN)
> 
> The cluster I've debugged with the same crash also got a lot of snapshot
> problems including this one.
> In the end, only manually marking all snap_ids as deleted in the pool
> helped.
> 
> 
> Paul
> 
> 2018-04-10 21:48 GMT+02:00 Jan Marquardt <jm@xxxxxxxxxxx
> <mailto:jm@xxxxxxxxxxx>>:
> 
>     Am 10.04.18 um 20:22 schrieb Paul Emmerich:
>     > Hi,
>     > 
>     > I encountered the same crash a few months ago, see
>     > https://tracker.ceph.com/issues/23030
>     <https://tracker.ceph.com/issues/23030>
>     > 
>     > Can you post the output of
>     > 
>     >    ceph osd pool ls detail -f json-pretty
>     > 
>     > 
>     > Paul
> 
>     Yes, of course.
> 
>     # ceph osd pool ls detail -f json-pretty
> 
>     [
>         {
>             "pool_name": "rbd",
>             "flags": 1,
>             "flags_names": "hashpspool",
>             "type": 1,
>             "size": 3,
>             "min_size": 2,
>             "crush_rule": 0,
>             "object_hash": 2,
>             "pg_num": 768,
>             "pg_placement_num": 768,
>             "crash_replay_interval": 0,
>             "last_change": "91256",
>             "last_force_op_resend": "0",
>             "last_force_op_resend_preluminous": "0",
>             "auid": 0,
>             "snap_mode": "selfmanaged",
>             "snap_seq": 35020,
>             "snap_epoch": 91219,
>             "pool_snaps": [],
>             "removed_snaps":
>     "[1~4562,47f1~58,484a~9,4854~70,48c5~36,48fc~48,4945~d,4953~1,4957~1,495a~3,4960~1,496e~3,497a~1,4980~2,4983~3,498b~1,4997~1,49a8~1,49ae~1,49b1~2,49b4~1,49b7~1,49b9~3,49bd~5,49c3~6,49ca~5,49d1~4,49d6~1,49d8~2,49df~2,49e2~1,49e4~2,49e7~5,49ef~2,49f2~2,49f5~6,49fc~1,49fe~3,4a05~9,4a0f~4,4a14~4,4a1a~6,4a21~6,4a29~2,4a2c~3,4a30~1,4a33~5,4a39~3,4a3e~b,4a4a~1,4a4c~2,4a50~1,4a52~7,4a5a~1,4a5c~2,4a5f~4,4a64~1,4a66~2,4a69~2,4a6c~4,4a72~1,4a74~2,4a78~3,4a7c~6,4a84~2,4a87~b,4a93~4,4a99~1,4a9c~4,4aa1~7,4aa9~1,4aab~6,4ab2~2,4ab5~5,4abb~2,4abe~9,4ac8~a,4ad3~4,4ad8~13,4aec~16,4b03~6,4b0a~c,4b17~2,4b1a~3,4b1f~4,4b24~c,4b31~d,4b3f~13,4b53~1,4bfc~13ed,61e1~4a,622c~8,6235~a0,62d6~ac,63a6~2,63b2~2,63d0~2,63f7~2,6427~2,6434~10f]",
>             "quota_max_bytes": 0,
>             "quota_max_objects": 0,
>             "tiers": [],
>             "tier_of": -1,
>             "read_tier": -1,
>             "write_tier": -1,
>             "cache_mode": "none",
>             "target_max_bytes": 0,
>             "target_max_objects": 0,
>             "cache_target_dirty_ratio_micro": 0,
>             "cache_target_dirty_high_ratio_micro": 0,
>             "cache_target_full_ratio_micro": 0,
>             "cache_min_flush_age": 0,
>             "cache_min_evict_age": 0,
>             "erasure_code_profile": "",
>             "hit_set_params": {
>                 "type": "none"
>             },
>             "hit_set_period": 0,
>             "hit_set_count": 0,
>             "use_gmt_hitset": true,
>             "min_read_recency_for_promote": 0,
>             "min_write_recency_for_promote": 0,
>             "hit_set_grade_decay_rate": 0,
>             "hit_set_search_last_n": 0,
>             "grade_table": [],
>             "stripe_width": 0,
>             "expected_num_objects": 0,
>             "fast_read": false,
>             "options": {},
>             "application_metadata": {
>                 "rbd": {}
>             }
>         }
>     ]
> 
>     "Unfortunately" I started the crashed OSDs again in the meantime,
>     because the first pgs have been down before. So currently all OSDs are
>     running.
> 
>     Regards,
> 
>     Jan
> 
> 
> 
> 
> 
> -- 
> -- 
> Paul Emmerich
> 
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io <http://www.croit.io>
> Tel: +49 89 1896585 90

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com