Re: inconsistent PG -> unfound objects on an erasure coded system

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 7 Mar 2016 12:56:49 -0800



On Mon, Mar 7, 2016 at 12:07 PM, Jeffrey McDonald <jmcdonal@xxxxxxx> wrote:
> Hi,
>
> For a while, we've been seeing inconsistent placement groups on our erasure
> coded system.   The placement groups go from a state of active+clean to
> active+clean+inconsistent after a deep scrub:
>
>
> 2016-03-07 13:45:42.044131 7f385d118700 -1 log_channel(cluster) log [ERR] :
> 70.320s0 deep-scrub stat mismatch, got 21446/21428 objects, 0/0 clones,
> 21446/21428 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts,
> 64682334170/64624353083 bytes,0/0 hit_set_archive bytes.
> 2016-03-07 13:45:42.044416 7f385d118700 -1 log_channel(cluster) log [ERR] :
> 70.320s0 deep-scrub 18 missing, 0 inconsistent objects
> 2016-03-07 13:45:42.044464 7f385d118700 -1 log_channel(cluster) log [ERR] :
> 70.320 deep-scrub 73 errors
>
> So I tell the placement group to perform a repair:
>
> 2016-03-07 13:49:26.047177 7f385d118700  0 log_channel(cluster) log [INF] :
> 70.320 repair starts
> 2016-03-07 13:49:57.087291 7f3858b0a700  0 -- 10.31.0.2:6874/13937 >>
> 10.31.0.6:6824/8127 pipe(0x2e578000 sd=697 :6874
>
> The repair finds missing shards and repairs them, but then I have 18
> 'unfound objects' :
>
>
> 2016-03-07 13:51:28.467590 7f385d118700 -1 log_channel(cluster) log [ERR] :
> 70.320s0 repair stat mismatch, got 21446/21428 objects, 0/0 clones,
> 21446/21428 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts,
> 64682334170/64624353083 bytes,0/0 hit_set_archive bytes.
> 2016-03-07 13:51:28.468358 7f385d118700 -1 log_channel(cluster) log [ERR] :
> 70.320s0 repair 18 missing, 0 inconsistent objects
> 2016-03-07 13:51:28.469431 7f385d118700 -1 log_channel(cluster) log [ERR] :
> 70.320 repair 73 errors, 73 fixed
>
>
> I've traced one of the unfound objects all the way through the system and
> I've found that they are not really lost.   I can fail over the osd and
> recover the files.   This is happening quite regularly now after a large
> migration of data from old hardware to new(migration is now complete).
>
> The system sets the PG into 'recovery', but we've seen the system in a
> recovering state for many days.    Should we just be patient or do we need
> to dig further into the issue?

You may need to dig into this more, although I'm not sure what the
issue is likely to be. What version of Ceph are you running? How did
you do this hardware migration?
-Greg

>
>
> pg 70.320 is stuck unclean for 704.803040, current state active+recovering,
> last acting [277,101,218,49,304,412]
> pg 70.320 is active+recovering, acting [277,101,218,49,304,412], 18 unfound
>
> There is no indication of any problems with down OSDs or network issues with
> OSDs.
>
> Thanks,
> Jeff
>
>
> --
>
> Jeffrey McDonald, PhD
> Assistant Director for HPC Operations
> Minnesota Supercomputing Institute
> University of Minnesota Twin Cities
> 599 Walter Library           email: jeffrey.mcdonald@xxxxxxxxxxx
> 117 Pleasant St SE           phone: +1 612 625-6905
> Minneapolis, MN 55455        fax:   +1 612 624-8861
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com