Re: Local Device Health PG inconsistent

Reed Dier <reed.dier@xxxxxxxxxxx> · Wed, 2 Oct 2019 15:04:35 -0500

And now to fill in the full circle.
Sadly my solution was to run
$ ceph pg repair 33.0
which returned
2019-10-02 15:38:54.499318 osd.12 (osd.12) 181 : cluster [DBG] 33.0 repair starts
2019-10-02 15:38:55.502606 osd.12 (osd.12) 182 : cluster [ERR] 33.0 repair : stat mismatch, got 264/265 objects, 0/0 clones, 264/265 dirty, 264/265 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes.
2019-10-02 15:38:55.503066 osd.12 (osd.12) 183 : cluster [ERR] 33.0 repair 1 errors, 1 fixed
And now my cluster is happy once more.
So, in case anyone else runs into this issue, and doesn't think to run pg repair on the pg in question, in this case, go for it.

Reed

On Sep 23, 2019, at 9:07 AM, Reed Dier <reed.dier@xxxxxxxxxxx> wrote:

And to come full circle,
After this whole saga, I now have a scrub error on the new device health metrics pool/PG in what looks to be the exact same way.
So I am at a loss for what ever it is that I am doing incorrectly, as a scrub error obviously makes the monitoring suite very happy.

$ ceph health detail
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
    pg 33.0 is active+clean+inconsistent, acting [12,138,15]
$ rados list-inconsistent-pg device_health_metrics
["33.0"]
$ rados list-inconsistent-obj 33.0 | jq
{
  "epoch": 176348,
  "inconsistents": []
}

I assume that this is the root cause:
ceph.log.5.gz:2019-09-18 11:12:16.466118 osd.138 (osd.138) 154 : cluster [WRN] bad locator @33 on object @33 op osd_op(client.1769585636.0:466 33.0 33:b08b92bd::::head [omap-set-vals] snapc 0=[] ondisk+write+known_if_redirected e176327) v8
ceph.log.1.gz:2019-09-22 20:41:44.937841 osd.12 (osd.12) 53 : cluster [DBG] 33.0 scrub starts
ceph.log.1.gz:2019-09-22 20:41:45.000638 osd.12 (osd.12) 54 : cluster [ERR] 33.0 scrub : stat mismatch, got 237/238 objects, 0/0 clones, 237/238 dirty, 237/238 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes.
ceph.log.1.gz:2019-09-22 20:41:45.000643 osd.12 (osd.12) 55 : cluster [ERR] 33.0 scrub 1 errors

Nothing fancy set for the plugin:
$ ceph config dump | grep device
global      basic    device_failure_prediction_mode     local
  mgr       advanced mgr/devicehealth/enable_monitoring true

Reed

On Sep 18, 2019, at 11:33 AM, Reed Dier <reed.dier@xxxxxxxxxxx> wrote:

And to provide some further updates,
I was able to get OSDs to boot by updating from 14.2.2 to 14.2.4.
Unclear why this would improve things, but it at least got me running again.

$ ceph versions
{
    "mon": {
        "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)": 3
    },
    "osd": {
        "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)": 199,
        "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)": 5
    },
    "mds": {
        "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)": 1
    },
    "overall": {
        "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)": 206,
        "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)": 5
    }
}

Reed

On Sep 18, 2019, at 10:12 AM, Reed Dier <reed.dier@xxxxxxxxxxx> wrote:

To answer the question, if it is safe to disable the module and delete the pool, the answer is no.
After disabling the diskprediction_local module, I then proceeded to remove the pool created by the module, device_health_metrics.

This is where things went south quickly,

Ceph health showed: 
Module 'devicehealth' has failed: [errno 2] Failed to operate write op for oid SAMSUNG_$MODEL_$SERIAL

That module apparently can't be disabled:$ ceph mgr module disable devicehealth
Error EINVAL: module 'devicehealth' cannot be disabled (always-on)

Then 5 osd's went down, crashing with:
   -12> 2019-09-18 10:53:00.299 7f95940ac700  5 osd.5 pg_epoch: 176304 pg[17.3d1( v 176297'568491 lc 176269'568471 (175914'565388,176297'568491] local-lis/les=176302/176303 n=107092 ec=11397/11397 lis/c 176302/172990 les/c/f 176303/172991/107766 176304/176304/176304) [5,81,162] r=0 lpr=176304 pi=[172990,176304)/1 crt=176297'568491 lcod 0'0 mlcod 0'0 peering m=17 mbc={}] enter Started/Primary/Peering/WaitUpThru
   -11> 2019-09-18 10:53:00.303 7f959fd6f700  2 osd.5 176304 ms_handle_reset con 0x564078474d00 session 0x56407878ea00
   -10> 2019-09-18 10:53:00.303 7f95b10e6700 10 monclient: handle_auth_request added challenge on 0x564077ac1b00
    -9> 2019-09-18 10:53:00.307 7f95b10e6700 10 monclient: handle_auth_request added challenge on 0x564077ac3180
    -8> 2019-09-18 10:53:00.307 7f95b10e6700 10 monclient: handle_auth_request added challenge on 0x564077ac3600
    -7> 2019-09-18 10:53:00.307 7f95950ae700 -1 bluestore(/var/lib/ceph/osd/ceph-5) _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0)
    -6> 2019-09-18 10:53:00.307 7f95950ae700  0 _dump_transaction transaction dump:
{
    "ops": [
        {
            "op_num": 0,
            "op_name": "remove",
            "collection": "30.0_head",
            "oid": "#30:00000000::::head#"
        },
        {
            "op_num": 1,
            "op_name": "rmcoll",
            "collection": "30.0_head"
        }
    ]
}
    -5> 2019-09-18 10:53:00.311 7f95948ad700  5 osd.5 pg_epoch: 176304 pg[17.353( v 176300'586919 lc 176207'586887 (175912'583861,176300'586919] local-lis/les=176302/176303 n=107009 ec=11397/11397 lis/c 176302/176285 les/c/f 176303/176286/107766 176304/176304/176304) [5,167,137] r=0 lpr=176304 pi=[176285,176304)/1 crt=176300'586919 lcod 0'0 mlcod 0'0 peering m=32 mbc={}] exit Started/Primary/Peering/GetLog 0.023847 2 0.000123
    -4> 2019-09-18 10:53:00.311 7f95948ad700  5 osd.5 pg_epoch: 176304 pg[17.353( v 176300'586919 lc 176207'586887 (175912'583861,176300'586919] local-lis/les=176302/176303 n=107009 ec=11397/11397 lis/c 176302/176285 les/c/f 176303/176286/107766 176304/176304/176304) [5,167,137] r=0 lpr=176304 pi=[176285,176304)/1 crt=176300'586919 lcod 0'0 mlcod 0'0 peering m=32 mbc={}] enter Started/Primary/Peering/GetMissing
    -3> 2019-09-18 10:53:00.311 7f95948ad700  5 osd.5 pg_epoch: 176304 pg[17.353( v 176300'586919 lc 176207'586887 (175912'583861,176300'586919] local-lis/les=176302/176303 n=107009 ec=11397/11397 lis/c 176302/176285 les/c/f 176303/176286/107766 176304/176304/176304) [5,167,137] r=0 lpr=176304 pi=[176285,176304)/1 crt=176300'586919 lcod 0'0 mlcod 0'0 peering m=32 mbc={}] exit Started/Primary/Peering/GetMissing 0.000019 0 0.000000
    -2> 2019-09-18 10:53:00.311 7f95948ad700  5 osd.5 pg_epoch: 176304 pg[17.353( v 176300'586919 lc 176207'586887 (175912'583861,176300'586919] local-lis/les=176302/176303 n=107009 ec=11397/11397 lis/c 176302/176285 les/c/f 176303/176286/107766 176304/176304/176304) [5,167,137] r=0 lpr=176304 pi=[176285,176304)/1 crt=176300'586919 lcod 0'0 mlcod 0'0 peering m=32 mbc={}] enter Started/Primary/Peering/WaitUpThru
    -1> 2019-09-18 10:53:00.315 7f95950ae700 -1 /build/ceph-14.2.2/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7f95950ae700 time 2019-09-18 10:53:00.312755
/build/ceph-14.2.2/src/os/bluestore/BlueStore.cc: 11208: ceph_abort_msg("unexpected error")

Of the 5 OSD's now down, 3 of them are the serving OSD's for pg 30.0 (that has now been erased),

OSD_DOWN 5 osds down
    osd.5 is down
    osd.12 is down
    osd.128 is down
    osd.183 is down
    osd.190 is down

But 190 and 5 were never acting members for that PG, so I have no clue why they are implicated.

I re-enabled the module, and that cleared the health error about devicehealth, which doesn't matter to me, but that also didn't solve the issue of the down OSDs, so I am hoping there is a way to mark this PG as lost, or something like that, so as to not have to rebuilt the entire OSD.

Any help is appreciated.

Reed

On Sep 12, 2019, at 5:22 PM, Reed Dier <reed.dier@xxxxxxxxxxx> wrote:

Trying to narrow down a strange issue where the single PG for the device_health_metrics that was created when I enabled the 'diskprediction_local' module in the ceph-mgr. But I never see any inconsistent objects in the PG.
$ ceph health detail
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
    pg 30.0 is active+clean+inconsistent, acting [128,12,183]

$ rados list-inconsistent-pg device_health_metrics
["30.0"]

$ rados list-inconsistent-obj 30.0 | jq
{
  "epoch": 172979,
  "inconsistents": []
}

This is the log message from osd.128 most recently during the last deep scrub
2019-09-12 18:07:19.436 7f977744a700 -1 log_channel(cluster) log [ERR] : 30.0 deep-scrub : stat mismatch, got 237/238 objects, 0/0 clones, 237/238 dirty, 237/238 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes.

Here is a pg query on the one PG:
https://pastebin.com/bnzVKd6t

The data I have collected hasn't been useful at all, and I don't particularly care if I lose it, so would it be feasible (ie no bad effects) to just disable the disk prediction module, delete the pool, and then start over and it will create a new pool for itself?

Thanks,

Reed

Attachment:
smime.p7s

Description: S/MIME cryptographic signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com