Re: Possible data damage: 1 pg inconsistent

Hervé Ballans <herve.ballans@xxxxxxxxxxxxx> · Fri, 21 Dec 2018 15:23:21 +0100



    Hi Frank,
    

    I encounter exactly the same issue with
      the same disks than yours. Every day, after a batch of deep
      scrubbing operation, ther are generally between 1 and 3
      inconsistent pgs, and that, on different OSDs.
    

    It could confirm a problem on these
      disks, but :
    

    - it concerns only the pgs of the rbd
      pool, not those of cephfs pools (the same disk model is used)

    
    - I encounter this when I was running
      12.2.5, not when I upgraded in 12.2.8 but the problem appears
      again after upgrade in 12.2.10

    
    - On my side, smartctl and dmesg do not
      show any media error, so I'm pretty sure that physical media is
      not concerned...
    

    Small precision: each disk is
      configured with RAID0 on a PERC740P, is this also the case for you
      or are your disks in JBOD mode ?
    

    Another question: in your case, the OSD
      who is involved in the inconsistent pgs is it always the same one
      or is it a new one every time ?
    

    For information, currently, the
      manually 'ceph pg repair' command works well each time...

    
    Context: Luminous 12.2.10, Bluestore
      OSD with data block on SATA disks and WAL/DB on NVMe, rbd
      configuration replica 3/2
    

    Cheers,

      rv
    

    Few outputs:
    

    $ sudo ceph -s

          cluster:

            id:     838506b7-e0c6-4022-9e17-2d1cf9458be6

            health: HEALTH_ERR

                    3 scrub errors

                    Possible data damage: 3 pgs inconsistent

        
          services:

            mon: 3 daemons, quorum
        inf-ceph-mon0,inf-ceph-mon1,inf-ceph-mon2

            mgr: inf-ceph-mon0(active), standbys: inf-ceph-mon1,
        inf-ceph-mon2

            mds: cephfs_home-2/2/2 up 
        {0=inf-ceph-mon1=up:active,1=inf-ceph-mon0=up:active}, 1
        up:standby

            osd: 126 osds: 126 up, 126 in

        
          data:

            pools:   3 pools, 4224 pgs

            objects: 23.35M objects, 20.9TiB

            usage:   64.9TiB used, 136TiB / 201TiB avail

            pgs:     4221 active+clean

                     3    active+clean+inconsistent

        
          io:

            client:   2.62KiB/s rd, 2.25MiB/s wr, 0op/s rd, 118op/s wr

      
      $ sudo ceph
          health detail

          HEALTH_ERR 3 scrub errors; Possible data damage: 3 pgs
          inconsistent

          OSD_SCRUB_ERRORS 3 scrub errors

          PG_DAMAGED Possible data damage: 3 pgs inconsistent

              pg 9.27 is active+clean+inconsistent, acting [78,107,96]

              pg 9.260 is active+clean+inconsistent, acting [84,113,62]

              pg 9.6b9 is active+clean+inconsistent, acting [79,107,80]
    $ sudo rados list-inconsistent-obj 9.27
        --format=json-prettyrados list-inconsistent-obj 9.27
        --format=json-pretty |grep error

                    "errors": [],

                    "union_shard_errors": [

                        "read_error"

                            "errors": [

                                "read_error"

                            "errors": [],

                            "errors": [],

        $ sudo rados list-inconsistent-obj 9.260
        --format=json-prettyrados list-inconsistent-obj 9.260
        --format=json-pretty |grep error

                    "errors": [],

                    "union_shard_errors": [

                        "read_error"

                            "errors": [],

                            "errors": [],

                            "errors": [

                                "read_error"

        $ sudo rados list-inconsistent-obj 9.6b9
        --format=json-prettyrados list-inconsistent-obj 9.6b9
        --format=json-pretty |grep error

                    "errors": [],

                    "union_shard_errors": [

                        "read_error"

                            "errors": [

                                "read_error"

                            "errors": [],

                            "errors": [],

        $ sudo ceph pg repair 9.27

        instructing pg 9.27 on osd.78 to repair

        $ sudo ceph pg repair 9.260

        instructing pg 9.260 on osd.84 to repair

        $ sudo ceph pg repair 9.6b9

        instructing pg 9.6b9 on osd.79 to repair

        $ sudo ceph -s

          cluster:

            id:     838506b7-e0c6-4022-9e17-2d1cf9458be6

            health: HEALTH_OK

        
          services:

            mon: 3 daemons, quorum
        inf-ceph-mon0,inf-ceph-mon1,inf-ceph-mon2

            mgr: inf-ceph-mon0(active), standbys: inf-ceph-mon1,
        inf-ceph-mon2

            mds: cephfs_home-2/2/2 up 
        {0=inf-ceph-mon1=up:active,1=inf-ceph-mon0=up:active}, 1
        up:standby

            osd: 126 osds: 126 up, 126 in

        
          data:

            pools:   3 pools, 4224 pgs

            objects: 23.35M objects, 20.9TiB

            usage:   64.9TiB used, 136TiB / 201TiB avail

            pgs:     4224 active+clean

        
          io:

            client:   195KiB/s rd, 7.19MiB/s wr, 17op/s rd, 127op/s wr

      
    Le 19/12/2018 à 04:48, Frank Ritchie a
      écrit :

    
                Hi all,
                  

                  I have been receiving alerts for:
                  

                  Possible data damage: 1 pg inconsistent

                  
                  almost daily for a few weeks now. When I check:
                  

                  rados list-inconsistent-obj $PG
                    --format=json-pretty

                  
                  I will always see a read_error. When I run a deep
                    scrub on the PG I will see:
                  

                  head candidate had a read error

                  
                  When I check dmesg on the osd node I see:
                  

                  blk_update_request: critical medium error, dev
                    sdX, sector 123
                  

                  I will also see a few uncorrected read errors in
                    smartctl.
                  

                  Info:
                  Ceph: ceph version 12.2.4-30.el7cp

                  
                  OSD: Toshiba 1.8TB SAS 10K
                  120 OSDs total
                  

                  Has anyone else seen these alerts occur almost
                    daily? Can the errors possibly be due to deep
                    scrubbing too aggressively?
                  

                  I realize these errors indicate potential failing
                    drives but I can't replace a drive daily.
                  

                  thx
                  Frank
                
              
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com