OSD SCRUB Error recovery

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi, I am seeing an issue on one of our older ceph clusters (mimic 13.2.1) in an erasure coded pool on bluestore OSDs in which we are seeing 1 inconsistent pg and 1 scrub error. It should be noted that we have an ongoing rebalance of misplaced data that predates this issue which came from flapping OSDs due to OSD_NEARFULL OSD_TOOFULL warnings/errors we corrected by removing some user data from ceph’s rgw/s3 api interface (users “s3 objects” where deleted via the s3 api).  

 

If anyone has any suggestions or guidance for dealing with this it would be very much appreciated. I’ve included all the relevant / helpful information I can think of below, if there is any additional information that you think would be helpful to me or you in providing suggestions please let me know.   

 

  $ sudo ceph -s

    cluster:

      id:     6fa7ec72-79fb-4f45-8b9f-ea5cdc7ab18d

      health: HEALTH_ERR

              248317/437145405 objects misplaced (0.057%)

              1 scrub errors

              Possible data damage: 1 pg inconsistent

      services:

      mon: 3 daemons, quorum HW-CEPHM-AT01,HW-CEPHM-AT02,HW-CEPHM-AT03

      mgr: HW-CEPHM-AT02(active)

      osd: 109 osds: 107 up, 106 in; 2 remapped pgs

      rgw: 3 daemons active

      data:

      pools:   10 pools, 1380 pgs

      objects: 54.70 M objects, 68 TiB

      usage:   116 TiB used, 169 TiB / 285 TiB avail

      pgs:     248317/437145405 objects misplaced (0.057%)

              1374 active+clean

              3    active+clean+scrubbing+deep

              2    active+remapped+backfilling

              1    active+clean+inconsistent

      io:

      client:   28 KiB/s rd, 306 KiB/s wr, 26 op/s rd, 30 op/s wr

      recovery: 6.2 MiB/s, 4 objects/s

 

    $ sudo ceph health detail

   HEALTH_ERR 247241/437143405 objects misplaced (0.057%); 1 scrub errors; Possible data damage: 1 pg inconsistent

    OBJECT_MISPLACED 247241/437143405 objects misplaced (0.057%)

    OSD_SCRUB_ERRORS 1 scrub errors

    PG_DAMAGED Possible data damage: 1 pg inconsistent

        pg 7.1 is active+clean+inconsistent, acting [2,57,51,15,20,28,9,39]

 

Examination of osd logs shows the error is in osd.2

 

    zgrep -Hn 'ERR' ceph-osd.2.log-20200614.gz

    ceph-osd.2.log-20200614.gz:1292:2020-06-14 03:31:06.572 7f94591a9700 -1 log_channel(cluster) log [ERR] : 7.1s0 deep-scrub stat mismatch, got 213029/213030 objects, 0/0 clones, 213029/213030 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 292308615921/292308670959 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes.

    ceph-osd.2.log-20200614.gz:1293:2020-06-14 03:31:06.572 7f94591a9700 -1 log_channel(cluster) log [ERR] : 7.1 deep-scrub 1 errors

 

All other OSDs appear to be clean of errors

 

The pg in question (7.1) has been instructed to repair/scrub/deep-scrub but I do not see any indication in it’s logs that it has done a scrub or repair (it does log a deep-scrub which comes back OK) and listing inconsistent objects seems to indicate no issues

 

  $ sudo rados list-inconsistent-pg default.rgw.buckets.data

  ["7.1"]

  $ sudo ceph pg repair 7.1

  instructing pg 7.1s0 on osd.2 to repair

 

  $ sudo ceph pg scrub 7.1

  instructing pg 7.1s0 on osd.2 to scrub

 

  $ sudo ceph pg deep-scrub 7.1

  instructing pg 7.1s0 on osd.2 to deep-scrub

 

  grep -HnEi 'scrub|repair|deep-scrub' ceph-osd.2.log

  ceph-osd.2.log:118:2020-06-14 07:28:10.139 7f94599aa700  0 log_channel(cluster) log [DBG] : 7.91 deep-scrub starts

  ceph-osd.2.log:177:2020-06-14 08:39:11.404 7f94599aa700  0 log_channel(cluster) log [DBG] : 7.91 deep-scrub ok

  ceph-osd.2.log:322:2020-06-14 12:17:31.405 7f94579a6700  0 log_channel(cluster) log [DBG] : 13.135 deep-scrub starts

  ceph-osd.2.log:323:2020-06-14 12:17:32.744 7f94579a6700  0 log_channel(cluster) log [DBG] : 13.135 deep-scrub ok

  ceph-osd.2.log:387:2020-06-14 13:40:35.941 7f94591a9700  0 log_channel(cluster) log [DBG] : 7.d8 deep-scrub starts

  ceph-osd.2.log:441:2020-06-14 14:49:06.111 7f94591a9700  0 log_channel(cluster) log [DBG] : 7.d8 deep-scrub ok

 

Only the last deep-scrub was manually triggered

 

  $ sudo rados list-inconsistent-obj 7.1 --format=json-pretty

  {

      "epoch": 30869,

      "inconsistents": []

  }

 

  $ sudo rados list-inconsistent-obj 7.1s0 --format=json-pretty

  {

      "epoch": 30869,

      "inconsistents": []

  }

I’m not sure why no inconsistents  (empty set) are reported in the above

Chris Shultz
Global Systems Architect
1 Stiles Road
Suite 202
Salem
NH
03079
United States
cshultz@xxxxxxxxxxxxxxxx
(m) 774.270.2679
korewireless.com
LinkedIn
Twitter
Learn More


Disclaimer

The contents of this message are addressed personally to, and thus solely intended for the addressee. A recipient who is neither the addressee, nor empowered to receive this message on behalf of the addressee, is kindly requested to immediately inform the sender and then delete the message. Any other use of the contents of this message and/or of the enclosures by any other person than the addressee is prohibited.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux