Re: ceph pg repair fails...?

Brad Hubbard <bhubbard@xxxxxxxxxx> · Wed, 2 Oct 2019 12:13:10 +1000



On Wed, Oct 2, 2019 at 1:15 AM Mattia Belluco <mattia.belluco@xxxxxx> wrote:
>
> Hi Jake,
>
> I am curious to see if your problem is similar to ours (despite the fact
> we are still on Luminous).
>
> Could you post the output of:
>
> rados list-inconsistent-obj <PG_NUM>
>
> and
>
> rados list-inconsistent-snapset <PG_NUM>

Make sure you scrub the pg before running these commands.
Take a look at the information in http://tracker.ceph.com/issues/24994
for hints on how to proceed.
'
>
> Thanks,
>
> Mattia
>
> On 10/1/19 1:08 PM, Jake Grimmett wrote:
> > Dear All,
> >
> > I've just found two inconsistent pg that fail to repair.
> >
> > This might be the same bug as shown here:
> >
> > <https://tracker.ceph.com/journals/145118/diff?detail_id=147260>
> >
> > Cluster is running Nautilus 14.2.2
> > OS is Scientific Linux 7.6
> > DB/WAL on NVMe, Data on 12TB HDD
> >
> > Logs below cab also be seen here: <http://p.ip.fi/UOUc>
> >
> > [root@ceph-s1 ~]# ceph health detail
> > HEALTH_ERR 22 scrub errors; Possible data damage: 2 pgs inconsistent
> > OSD_SCRUB_ERRORS 22 scrub errors
> > PG_DAMAGED Possible data damage: 2 pgs inconsistent
> >     pg 2.2a7 is active+clean+inconsistent+failed_repair, acting
> > [83,60,133,326,281,162,180,172,144,219]
> >     pg 2.36b is active+clean+inconsistent+failed_repair, acting
> > [254,268,10,262,32,280,211,114,169,53]
> >
> > Issued "pg repair" commands, osd log shows:
> > [root@ceph-n10 ~]# grep "2.2a7" /var/log/ceph/ceph-osd.83.log
> > 2019-10-01 07:05:02.459 7f9adab4b700  0 log_channel(cluster) log [DBG] :
> > 2.2a7 repair starts
> > 2019-10-01 07:11:41.589 7f9adab4b700 -1 log_channel(cluster) log [ERR] :
> > 2.2a7 shard 83(0) soid 2:e5472cab:::1000702081f.00000000:head :
> > candidate size 4096 info size 0 mismatch
> > 2019-10-01 07:11:41.589 7f9adab4b700 -1 log_channel(cluster) log [ERR] :
> > 2.2a7 shard 60(1) soid 2:e5472cab:::1000702081f.00000000:head :
> > candidate size 4096 info size 0 mismatch
> > 2019-10-01 07:11:41.589 7f9adab4b700 -1 log_channel(cluster) log [ERR] :
> > 2.2a7 shard 133(2) soid 2:e5472cab:::1000702081f.00000000:head :
> > candidate size 4096 info size 0 mismatch
> > 2019-10-01 07:11:41.589 7f9adab4b700 -1 log_channel(cluster) log [ERR] :
> > 2.2a7 shard 144(8) soid 2:e5472cab:::1000702081f.00000000:head :
> > candidate size 4096 info size 0 mismatch
> > 2019-10-01 07:11:41.589 7f9adab4b700 -1 log_channel(cluster) log [ERR] :
> > 2.2a7 shard 162(5) soid 2:e5472cab:::1000702081f.00000000:head :
> > candidate size 4096 info size 0 mismatch
> > 2019-10-01 07:11:41.589 7f9adab4b700 -1 log_channel(cluster) log [ERR] :
> > 2.2a7 shard 172(7) soid 2:e5472cab:::1000702081f.00000000:head :
> > candidate size 4096 info size 0 mismatch
> > 2019-10-01 07:11:41.589 7f9adab4b700 -1 log_channel(cluster) log [ERR] :
> > 2.2a7 shard 180(6) soid 2:e5472cab:::1000702081f.00000000:head :
> > candidate size 4096 info size 0 mismatch
> > 2019-10-01 07:11:41.589 7f9adab4b700 -1 log_channel(cluster) log [ERR] :
> > 2.2a7 shard 219(9) soid 2:e5472cab:::1000702081f.00000000:head :
> > candidate size 4096 info size 0 mismatch
> > 2019-10-01 07:11:41.589 7f9adab4b700 -1 log_channel(cluster) log [ERR] :
> > 2.2a7 shard 281(4) soid 2:e5472cab:::1000702081f.00000000:head :
> > candidate size 4096 info size 0 mismatch
> > 2019-10-01 07:11:41.589 7f9adab4b700 -1 log_channel(cluster) log [ERR] :
> > 2.2a7 shard 326(3) soid 2:e5472cab:::1000702081f.00000000:head :
> > candidate size 4096 info size 0 mismatch
> > 2019-10-01 07:11:41.589 7f9adab4b700 -1 log_channel(cluster) log [ERR] :
> > 2.2a7 soid 2:e5472cab:::1000702081f.00000000:head : failed to pick
> > suitable object info
> > 2019-10-01 07:11:41.589 7f9adab4b700 -1 log_channel(cluster) log [ERR] :
> > repair 2.2a7s0 2:e5472cab:::1000702081f.00000000:head : on disk size
> > (4096) does not match object info size (0) adjusted for ondisk to (0)
> > 2019-10-01 07:19:47.060 7f9adab4b700 -1 log_channel(cluster) log [ERR] :
> > 2.2a7 repair 11 errors, 0 fixed
> > [root@ceph-n10 ~]#
> >
> > [root@ceph-s1 ~]#  ceph pg repair 2.36b
> > instructing pg 2.36bs0 on osd.254 to repair
> >
> > [root@ceph-n29 ~]# grep "2.36b" /var/log/ceph/ceph-osd.254.log
> > 2019-10-01 11:15:12.215 7fa01f589700  0 log_channel(cluster) log [DBG] :
> > 2.36b repair starts
> > 2019-10-01 11:25:12.241 7fa01f589700 -1 log_channel(cluster) log [ERR] :
> > 2.36b shard 254(0) soid 2:d6cac754:::100070209f6.00000000:head :
> > candidate size 4096 info size 0 mismatch
> > 2019-10-01 11:25:12.241 7fa01f589700 -1 log_channel(cluster) log [ERR] :
> > 2.36b shard 10(2) soid 2:d6cac754:::100070209f6.00000000:head :
> > candidate size 4096 info size 0 mismatch
> > 2019-10-01 11:25:12.241 7fa01f589700 -1 log_channel(cluster) log [ERR] :
> > 2.36b shard 32(4) soid 2:d6cac754:::100070209f6.00000000:head :
> > candidate size 4096 info size 0 mismatch
> > 2019-10-01 11:25:12.241 7fa01f589700 -1 log_channel(cluster) log [ERR] :
> > 2.36b shard 53(9) soid 2:d6cac754:::100070209f6.00000000:head :
> > candidate size 4096 info size 0 mismatch
> > 2019-10-01 11:25:12.241 7fa01f589700 -1 log_channel(cluster) log [ERR] :
> > 2.36b shard 114(7) soid 2:d6cac754:::100070209f6.00000000:head :
> > candidate size 4096 info size 0 mismatch
> > 2019-10-01 11:25:12.241 7fa01f589700 -1 log_channel(cluster) log [ERR] :
> > 2.36b shard 169(8) soid 2:d6cac754:::100070209f6.00000000:head :
> > candidate size 4096 info size 0 mismatch
> > 2019-10-01 11:25:12.241 7fa01f589700 -1 log_channel(cluster) log [ERR] :
> > 2.36b shard 211(6) soid 2:d6cac754:::100070209f6.00000000:head :
> > candidate size 4096 info size 0 mismatch
> > 2019-10-01 11:25:12.241 7fa01f589700 -1 log_channel(cluster) log [ERR] :
> > 2.36b shard 262(3) soid 2:d6cac754:::100070209f6.00000000:head :
> > candidate size 4096 info size 0 mismatch
> > 2019-10-01 11:25:12.241 7fa01f589700 -1 log_channel(cluster) log [ERR] :
> > 2.36b shard 268(1) soid 2:d6cac754:::100070209f6.00000000:head :
> > candidate size 4096 info size 0 mismatch
> > 2019-10-01 11:25:12.241 7fa01f589700 -1 log_channel(cluster) log [ERR] :
> > 2.36b shard 280(5) soid 2:d6cac754:::100070209f6.00000000:head :
> > candidate size 4096 info size 0 mismatch
> > 2019-10-01 11:25:12.241 7fa01f589700 -1 log_channel(cluster) log [ERR] :
> > 2.36b soid 2:d6cac754:::100070209f6.00000000:head : failed to pick
> > suitable object info
> > 2019-10-01 11:25:12.241 7fa01f589700 -1 log_channel(cluster) log [ERR] :
> > repair 2.36bs0 2:d6cac754:::100070209f6.00000000:head : on disk size
> > (4096) does not match object info size (0) adjusted for ondisk to (0)
> > 2019-10-01 11:30:10.573 7fa01f589700 -1 log_channel(cluster) log [ERR] :
> > 2.36b repair 11 errors, 0 fixed
> >
> > Any advice on fixing this would be very welcome!
> >
> > Best regards,
> >
> > Jake
> >
>
>
> --
> Mattia Belluco
> S3IT Services and Support for Science IT
> Office Y11 F 52
> University of Zürich
> Winterthurerstrasse 190, CH-8057 Zürich (Switzerland)
> Tel: +41 44 635 42 22
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
Cheers,
Brad

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com