Re: "ceph pg scrub" does not start

Jake Grimmett <jog@xxxxxxxxxxxxxxxxx> · Wed, 4 Jul 2018 14:17:16 +0100

Hi Sean,

Many thanks for the suggestion, but unfortunately deep-scrub also
appears to be ignored:

 # ceph pg deep-scrub 4.ff
instructing pg 4.ffs0 on osd.318 to deep-scrub

'tail -f ceph-osd.318.log' shows no new entries.

To get more info, I set debug level 10 on the osd, and issued another
repair command:

# ceph daemon osd.318 config set debug_osd 10
# ceph pg repair 4.ff
instructing pg 4.ffs0 on osd.318 to repair

Tailing OSD log, showed what might be an appropriate response:

2018-07-04 13:54:44.181 7faaaeaa8700 10 osd.318 pg_epoch: 180138
pg[4.ffs0( v 180138'5043225 (180078'5040201,180138'5043225]
local-lis/les=179843/179844 n=124423 ec=735/735 lis/c 179843/179843
les/c/f 179844/180011/0 179841/179843/174426)
[318,403,150,13,225,261,382,175,282,324]p318(0) r=0 lpr=179843
crt=180138'5043225 lcod 180138'5043224 mlcod 180138'5043224
active+clean+inconsistent MUST_REPAIR MUST_DEEP_SCRUB MUST_SCRUB ps=926]
state<Started/Primary>: marking for scrub

However, the scrub still doesn't start...

# ceph pg 4.ff query
shows .....
"last_deep_scrub_stamp": "2018-07-01 18:00:41.769956",
                "last_clean_scrub_stamp": "2018-06-27 05:55:13.023760",
                    "num_scrub_errors": 23,
                    "num_shallow_scrub_errors": 0,
                    "num_deep_scrub_errors": 23,

"scrub": {
                "scrubber.epoch_start": "178857",
                "scrubber.active": false,
                "scrubber.state": "INACTIVE",
                "scrubber.start": "MIN",
                "scrubber.end": "MIN",
                "scrubber.max_end": "MIN",
                "scrubber.subset_last_update": "0'0",
                "scrubber.deep": false,
                "scrubber.waiting_on_whom": []

Not sure where to go from here :(

Jake

On 04/07/18 01:14, Sean Redmond wrote:
> do a deep-scrub instead of just a scrub
> 
> On Tue, 3 Jul 2018, 12:37 Jake Grimmett, <jog@xxxxxxxxxxxxxxxxx
> <mailto:jog@xxxxxxxxxxxxxxxxx>> wrote:
> 
>     Dear All,
> 
>     Sorry to bump the thread, but I still can't manually repair inconsistent
>     pgs on our Mimic cluster (13.2.0, upgraded from 12.2.5)
> 
>     There are many similarities to an unresolved bug:
> 
>     http://tracker.ceph.com/issues/15781
> 
>     To give more examples of the problem:
> 
>     The following commands appear to run OK, but *nothing* appears in the
>     osd log to indicate that the commands are running. The OSD's are
>     otherwise working & logging OK.
> 
>     # ceph pg scrub 4.e19
>     instructing pg 4.e19s0 on osd.246 to scrub
> 
>     # ceph pg repair 4.e19
>     instructing pg 4.e19s0 on osd.246 to repair
> 
>     # ceph osd scrub 246
>     instructed osd(s) 246 to scrub
> 
>     # ceph osd repair 246
>     instructed osd(s) 246 to repair
> 
>     It does not matter which osd or pg the repair is initiated on.
> 
>     This command also fails:
>     # rados list-inconsistent-obj 4.e19
>     No scrub information available for pg 4.e19
>     error 2: (2) No such file or directory
> 
>     >From the OSD logs, and 'ceph -s' I can see that the OSD's are still
>     doing automatic background pg scrubs, just not the ones I have asked
>     them to do, at the time of my request they are not currently scrubbing.
> 
>     Could it be that my commands are not being sent to the OSD's?
> 
>     Any idea on how to debug this?
> 
>     ...
> 
>     Further info:
> 
>     Output of 'ceph pg 4.e19 query' is here:
>     http://p.ip.fi/9x5v
> 
>     Output of 'ceph daemon osd.246 config show' is here
>     http://p.ip.fi/RAuk
> 
>     Cluster has 10 nodes, 128GB RAM, dual Xeon
>     450 Bluestore SATA OSD, EC 8:2
>     4 NVME OSD, replicated
>     used for cephfs (2.3PB), daily snapshots only
> 
>     # ceph health detail
>     HEALTH_ERR 9500031/5149746146 objects misplaced (0.184%); 80 scrub
>     errors; Possible data damage: 7 pgs inconsistent
>     OBJECT_MISPLACED 9500031/5149746146 objects misplaced (0.184%)
>     OSD_SCRUB_ERRORS 80 scrub errors
>     PG_DAMAGED Possible data damage: 7 pgs inconsistent
>         pg 4.ff is active+clean+inconsistent, acting
>     [318,403,150,13,225,261,382,175,282,324]
>         pg 4.2e2 is active+clean+inconsistent, acting
>     [352,59,328,451,195,119,42,66,158,150]
>         pg 4.551 is active+clean+inconsistent, acting
>     [391,105,124,150,205,22,269,184,293,91]
>         pg 4.61c is active+clean+inconsistent, acting
>     [382,131,84,35,282,214,236,366,309,150]
>         pg 4.8cd is active+clean+inconsistent, acting
>     [353,58,5,252,187,183,323,150,387,32]
>         pg 4.a20 is active+clean+inconsistent, acting
>     [346,104,398,282,225,133,150,70,165,17]
>         pg 4.e19 is active+clean+inconsistent, acting
>     [246,447,245,98,170,348,111,155,150,295]
> 
>     again, thanks for any advice,
> 
>     Jake
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Dr Jake Grimmett
Head Of Scientific Computing
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.
Phone 01223 267019
Mobile 0776 9886539
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com