Re: inconsistent PG -> unfound objects on an erasure coded system

Samuel Just <sjust@xxxxxxxxxx> · Mon, 7 Mar 2016 15:30:22 -0800



Hmm, instead, please rescrub on the same osds the same pg with the
same logging and send the logs again.  There rae two object
inconsistent in the last set of logs (not 18).  I bet in the next
scrub either none are inconsistent, or it's a disjoint set.
-Sam

On Mon, Mar 7, 2016 at 3:19 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:
> Hmm, at the end of the log, the pg is still inconsistent.  Can you
> attach a ceph pg query on that pg?
> -Sam
>
> On Mon, Mar 7, 2016 at 3:05 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:
>> If so, that strongly suggests that the pg was actually never
>> inconsistent in the first place and that the bug is in scrub itself
>> presumably getting confused about an object during a write.  The next
>> step would be to get logs like the above from a pg as it scrubs
>> transitioning from clean to inconsistent.  If it's really a race
>> between scrub and a write, it's probably just non-deterministic, you
>> could set logging on a set of osds and continuously scrub any pgs
>> which only map to those osds until you reproduce the problem.
>> -Sam
>>
>> On Mon, Mar 7, 2016 at 2:44 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:
>>> So after the scrub, it came up clean?  The inconsistent/missing
>>> objects reappeared?
>>> -Sam
>>>
>>> On Mon, Mar 7, 2016 at 2:33 PM, Jeffrey McDonald <jmcdonal@xxxxxxx> wrote:
>>>> Hi Sam,
>>>>
>>>> I've done as you requested:
>>>>
>>>> pg 70.459 is active+clean+inconsistent, acting [307,210,273,191,132,450]
>>>>
>>>> # for i in 307 210 273 191 132 450 ; do
>>>>> ceph tell osd.$i injectargs  '--debug-osd 20 --debug-filestore 20
>>>>> --debug-ms 1'
>>>>> done
>>>> debug_osd=20/20 debug_filestore=20/20 debug_ms=1/1
>>>> debug_osd=20/20 debug_filestore=20/20 debug_ms=1/1
>>>> debug_osd=20/20 debug_filestore=20/20 debug_ms=1/1
>>>> debug_osd=20/20 debug_filestore=20/20 debug_ms=1/1
>>>> debug_osd=20/20 debug_filestore=20/20 debug_ms=1/1
>>>> debug_osd=20/20 debug_filestore=20/20 debug_ms=1/1
>>>>
>>>>
>>>> # date
>>>> Mon Mar  7 16:03:38 CST 2016
>>>>
>>>>
>>>> # ceph pg deep-scrub 70.459
>>>> instructing pg 70.459 on osd.307 to deep-scrub
>>>>
>>>>
>>>>
>>>> Scrub finished around
>>>>
>>>> # date
>>>> Mon Mar  7 16:13:03 CST 2016
>>>>
>>>>
>>>>
>>>>
>>>> I've tar'd+gziped the files which can be downloaded from here.   The logs
>>>> start a minute or two after today at 16:00.
>>>>
>>>> https://drive.google.com/folderview?id=0Bzz8TrxFvfema2NQUmotd1BOTnM&usp=sharing
>>>>
>>>>
>>>> Oddly(to me anyways), this pg is now active+clean:
>>>>
>>>> # ceph pg dump  | grep 70.459
>>>> dumped all in format plain
>>>> 70.459 21377 0 0 0 0 64515446306 3088 3088 active+clean 2016-03-07
>>>> 16:26:57.796537 279563'212832 279602:628151 [307,210,273,191,132,450] 307
>>>> [307,210,273,191,132,450] 307 279563'212832 2016-03-07 16:12:30.741984
>>>> 279563'212832 2016-03-07 16:12:30.741984
>>>>
>>>>
>>>>
>>>> Regards,
>>>> Jeff
>>>>
>>>>
>>>> On Mon, Mar 7, 2016 at 4:11 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:
>>>>>
>>>>> I think the unfound object on repair is fixed by
>>>>> d51806f5b330d5f112281fbb95ea6addf994324e (not in hammer yet).  I
>>>>> opened http://tracker.ceph.com/issues/15002 for the backport and to
>>>>> make sure it's covered in ceph-qa-suite.  No idea at this time why the
>>>>> objects are disappearing though.
>>>>> -Sam
>>>>>
>>>>> On Mon, Mar 7, 2016 at 1:57 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:
>>>>> > The one just scrubbed and now inconsistent.
>>>>> > -Sam
>>>>> >
>>>>> > On Mon, Mar 7, 2016 at 1:57 PM, Jeffrey McDonald <jmcdonal@xxxxxxx>
>>>>> > wrote:
>>>>> >> Do you want me to enable this for the pg already with unfound objects
>>>>> >> or the
>>>>> >> placement group just scrubbed and now inconsistent?
>>>>> >> Jeff
>>>>> >>
>>>>> >> On Mon, Mar 7, 2016 at 3:54 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:
>>>>> >>>
>>>>> >>> Can you enable
>>>>> >>>
>>>>> >>> debug osd = 20
>>>>> >>> debug filestore = 20
>>>>> >>> debug ms = 1
>>>>> >>>
>>>>> >>> on all osds in that PG, rescrub, and convey to us the resulting logs?
>>>>> >>> -Sam
>>>>> >>>
>>>>> >>> On Mon, Mar 7, 2016 at 1:36 PM, Jeffrey McDonald <jmcdonal@xxxxxxx>
>>>>> >>> wrote:
>>>>> >>> > Here is a PG which just went inconsistent:
>>>>> >>> >
>>>>> >>> > pg 70.459 is active+clean+inconsistent, acting
>>>>> >>> > [307,210,273,191,132,450]
>>>>> >>> >
>>>>> >>> > Attached is the result of a pg query on this.   I will wait for your
>>>>> >>> > feedback before issuing a repair.
>>>>> >>> >
>>>>> >>> > From what I read, the inconsistencies are more likely the result of
>>>>> >>> > ntp,
>>>>> >>> > but
>>>>> >>> > all nodes have the local ntp master and all are showing sync.
>>>>> >>> >
>>>>> >>> > Regards,
>>>>> >>> > Jeff
>>>>> >>> >
>>>>> >>> > On Mon, Mar 7, 2016 at 3:15 PM, Gregory Farnum <gfarnum@xxxxxxxxxx>
>>>>> >>> > wrote:
>>>>> >>> >>
>>>>> >>> >> [ Keeping this on the users list. ]
>>>>> >>> >>
>>>>> >>> >> Okay, so next time this happens you probably want to do a pg query
>>>>> >>> >> on
>>>>> >>> >> the PG which has been reported as dirty. I can't help much beyond
>>>>> >>> >> that, but hopefully Kefu or David will chime in once there's a
>>>>> >>> >> little
>>>>> >>> >> more for them to look at.
>>>>> >>> >> -Greg
>>>>> >>> >>
>>>>> >>> >> On Mon, Mar 7, 2016 at 1:00 PM, Jeffrey McDonald <jmcdonal@xxxxxxx>
>>>>> >>> >> wrote:
>>>>> >>> >> > Hi Greg,
>>>>> >>> >> >
>>>>> >>> >> > I'm running the ceph version hammer,
>>>>> >>> >> > ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
>>>>> >>> >> >
>>>>> >>> >> > The hardware migration was performed by just setting the crush
>>>>> >>> >> > map to
>>>>> >>> >> > zero
>>>>> >>> >> > for the OSD we wanted to retire.   The system was performing
>>>>> >>> >> > poorly
>>>>> >>> >> > with
>>>>> >>> >> > these older OSDs and we had a difficult time maintaining
>>>>> >>> >> > stability of
>>>>> >>> >> > the
>>>>> >>> >> > system.    The old OSDs are still there but all of the data is
>>>>> >>> >> > now
>>>>> >>> >> > migrated
>>>>> >>> >> > to new and/or existing hardware.
>>>>> >>> >> >
>>>>> >>> >> > Thanks,
>>>>> >>> >> > Jeff
>>>>> >>> >> >
>>>>> >>> >> >
>>>>> >>> >> >
>>>>> >>> >> >
>>>>> >>> >> >
>>>>> >>> >> > On Mon, Mar 7, 2016 at 2:56 PM, Gregory Farnum
>>>>> >>> >> > <gfarnum@xxxxxxxxxx>
>>>>> >>> >> > wrote:
>>>>> >>> >> >>
>>>>> >>> >> >> On Mon, Mar 7, 2016 at 12:07 PM, Jeffrey McDonald
>>>>> >>> >> >> <jmcdonal@xxxxxxx>
>>>>> >>> >> >> wrote:
>>>>> >>> >> >> > Hi,
>>>>> >>> >> >> >
>>>>> >>> >> >> > For a while, we've been seeing inconsistent placement groups
>>>>> >>> >> >> > on
>>>>> >>> >> >> > our
>>>>> >>> >> >> > erasure
>>>>> >>> >> >> > coded system.   The placement groups go from a state of
>>>>> >>> >> >> > active+clean
>>>>> >>> >> >> > to
>>>>> >>> >> >> > active+clean+inconsistent after a deep scrub:
>>>>> >>> >> >> >
>>>>> >>> >> >> >
>>>>> >>> >> >> > 2016-03-07 13:45:42.044131 7f385d118700 -1
>>>>> >>> >> >> > log_channel(cluster)
>>>>> >>> >> >> > log
>>>>> >>> >> >> > [ERR] :
>>>>> >>> >> >> > 70.320s0 deep-scrub stat mismatch, got 21446/21428 objects,
>>>>> >>> >> >> > 0/0
>>>>> >>> >> >> > clones,
>>>>> >>> >> >> > 21446/21428 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0
>>>>> >>> >> >> > whiteouts,
>>>>> >>> >> >> > 64682334170/64624353083 bytes,0/0 hit_set_archive bytes.
>>>>> >>> >> >> > 2016-03-07 13:45:42.044416 7f385d118700 -1
>>>>> >>> >> >> > log_channel(cluster)
>>>>> >>> >> >> > log
>>>>> >>> >> >> > [ERR] :
>>>>> >>> >> >> > 70.320s0 deep-scrub 18 missing, 0 inconsistent objects
>>>>> >>> >> >> > 2016-03-07 13:45:42.044464 7f385d118700 -1
>>>>> >>> >> >> > log_channel(cluster)
>>>>> >>> >> >> > log
>>>>> >>> >> >> > [ERR] :
>>>>> >>> >> >> > 70.320 deep-scrub 73 errors
>>>>> >>> >> >> >
>>>>> >>> >> >> > So I tell the placement group to perform a repair:
>>>>> >>> >> >> >
>>>>> >>> >> >> > 2016-03-07 13:49:26.047177 7f385d118700  0
>>>>> >>> >> >> > log_channel(cluster)
>>>>> >>> >> >> > log
>>>>> >>> >> >> > [INF] :
>>>>> >>> >> >> > 70.320 repair starts
>>>>> >>> >> >> > 2016-03-07 13:49:57.087291 7f3858b0a700  0 --
>>>>> >>> >> >> > 10.31.0.2:6874/13937
>>>>> >>> >> >> > >>
>>>>> >>> >> >> > 10.31.0.6:6824/8127 pipe(0x2e578000 sd=697 :6874
>>>>> >>> >> >> >
>>>>> >>> >> >> > The repair finds missing shards and repairs them, but then I
>>>>> >>> >> >> > have
>>>>> >>> >> >> > 18
>>>>> >>> >> >> > 'unfound objects' :
>>>>> >>> >> >> >
>>>>> >>> >> >> >
>>>>> >>> >> >> > 2016-03-07 13:51:28.467590 7f385d118700 -1
>>>>> >>> >> >> > log_channel(cluster)
>>>>> >>> >> >> > log
>>>>> >>> >> >> > [ERR] :
>>>>> >>> >> >> > 70.320s0 repair stat mismatch, got 21446/21428 objects, 0/0
>>>>> >>> >> >> > clones,
>>>>> >>> >> >> > 21446/21428 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0
>>>>> >>> >> >> > whiteouts,
>>>>> >>> >> >> > 64682334170/64624353083 bytes,0/0 hit_set_archive bytes.
>>>>> >>> >> >> > 2016-03-07 13:51:28.468358 7f385d118700 -1
>>>>> >>> >> >> > log_channel(cluster)
>>>>> >>> >> >> > log
>>>>> >>> >> >> > [ERR] :
>>>>> >>> >> >> > 70.320s0 repair 18 missing, 0 inconsistent objects
>>>>> >>> >> >> > 2016-03-07 13:51:28.469431 7f385d118700 -1
>>>>> >>> >> >> > log_channel(cluster)
>>>>> >>> >> >> > log
>>>>> >>> >> >> > [ERR] :
>>>>> >>> >> >> > 70.320 repair 73 errors, 73 fixed
>>>>> >>> >> >> >
>>>>> >>> >> >> >
>>>>> >>> >> >> > I've traced one of the unfound objects all the way through the
>>>>> >>> >> >> > system
>>>>> >>> >> >> > and
>>>>> >>> >> >> > I've found that they are not really lost.   I can fail over
>>>>> >>> >> >> > the
>>>>> >>> >> >> > osd
>>>>> >>> >> >> > and
>>>>> >>> >> >> > recover the files.   This is happening quite regularly now
>>>>> >>> >> >> > after a
>>>>> >>> >> >> > large
>>>>> >>> >> >> > migration of data from old hardware to new(migration is now
>>>>> >>> >> >> > complete).
>>>>> >>> >> >> >
>>>>> >>> >> >> > The system sets the PG into 'recovery', but we've seen the
>>>>> >>> >> >> > system
>>>>> >>> >> >> > in
>>>>> >>> >> >> > a
>>>>> >>> >> >> > recovering state for many days.    Should we just be patient
>>>>> >>> >> >> > or do
>>>>> >>> >> >> > we
>>>>> >>> >> >> > need
>>>>> >>> >> >> > to dig further into the issue?
>>>>> >>> >> >>
>>>>> >>> >> >> You may need to dig into this more, although I'm not sure what
>>>>> >>> >> >> the
>>>>> >>> >> >> issue is likely to be. What version of Ceph are you running? How
>>>>> >>> >> >> did
>>>>> >>> >> >> you do this hardware migration?
>>>>> >>> >> >> -Greg
>>>>> >>> >> >>
>>>>> >>> >> >> >
>>>>> >>> >> >> >
>>>>> >>> >> >> > pg 70.320 is stuck unclean for 704.803040, current state
>>>>> >>> >> >> > active+recovering,
>>>>> >>> >> >> > last acting [277,101,218,49,304,412]
>>>>> >>> >> >> > pg 70.320 is active+recovering, acting
>>>>> >>> >> >> > [277,101,218,49,304,412],
>>>>> >>> >> >> > 18
>>>>> >>> >> >> > unfound
>>>>> >>> >> >> >
>>>>> >>> >> >> > There is no indication of any problems with down OSDs or
>>>>> >>> >> >> > network
>>>>> >>> >> >> > issues
>>>>> >>> >> >> > with
>>>>> >>> >> >> > OSDs.
>>>>> >>> >> >> >
>>>>> >>> >> >> > Thanks,
>>>>> >>> >> >> > Jeff
>>>>> >>> >> >> >
>>>>> >>> >> >> >
>>>>> >>> >> >> > --
>>>>> >>> >> >> >
>>>>> >>> >> >> > Jeffrey McDonald, PhD
>>>>> >>> >> >> > Assistant Director for HPC Operations
>>>>> >>> >> >> > Minnesota Supercomputing Institute
>>>>> >>> >> >> > University of Minnesota Twin Cities
>>>>> >>> >> >> > 599 Walter Library           email:
>>>>> >>> >> >> > jeffrey.mcdonald@xxxxxxxxxxx
>>>>> >>> >> >> > 117 Pleasant St SE           phone: +1 612 625-6905
>>>>> >>> >> >> > Minneapolis, MN 55455        fax:   +1 612 624-8861
>>>>> >>> >> >> >
>>>>> >>> >> >> >
>>>>> >>> >> >> >
>>>>> >>> >> >> > _______________________________________________
>>>>> >>> >> >> > ceph-users mailing list
>>>>> >>> >> >> > ceph-users@xxxxxxxxxxxxxx
>>>>> >>> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>> >>> >> >> >
>>>>> >>> >> >
>>>>> >>> >> >
>>>>> >>> >> >
>>>>> >>> >> >
>>>>> >>> >> > --
>>>>> >>> >> >
>>>>> >>> >> > Jeffrey McDonald, PhD
>>>>> >>> >> > Assistant Director for HPC Operations
>>>>> >>> >> > Minnesota Supercomputing Institute
>>>>> >>> >> > University of Minnesota Twin Cities
>>>>> >>> >> > 599 Walter Library           email: jeffrey.mcdonald@xxxxxxxxxxx
>>>>> >>> >> > 117 Pleasant St SE           phone: +1 612 625-6905
>>>>> >>> >> > Minneapolis, MN 55455        fax:   +1 612 624-8861
>>>>> >>> >> >
>>>>> >>> >> >
>>>>> >>> >
>>>>> >>> >
>>>>> >>> >
>>>>> >>> >
>>>>> >>> > --
>>>>> >>> >
>>>>> >>> > Jeffrey McDonald, PhD
>>>>> >>> > Assistant Director for HPC Operations
>>>>> >>> > Minnesota Supercomputing Institute
>>>>> >>> > University of Minnesota Twin Cities
>>>>> >>> > 599 Walter Library           email: jeffrey.mcdonald@xxxxxxxxxxx
>>>>> >>> > 117 Pleasant St SE           phone: +1 612 625-6905
>>>>> >>> > Minneapolis, MN 55455        fax:   +1 612 624-8861
>>>>> >>> >
>>>>> >>> >
>>>>> >>> >
>>>>> >>> > _______________________________________________
>>>>> >>> > ceph-users mailing list
>>>>> >>> > ceph-users@xxxxxxxxxxxxxx
>>>>> >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>> >>> >
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> --
>>>>> >>
>>>>> >> Jeffrey McDonald, PhD
>>>>> >> Assistant Director for HPC Operations
>>>>> >> Minnesota Supercomputing Institute
>>>>> >> University of Minnesota Twin Cities
>>>>> >> 599 Walter Library           email: jeffrey.mcdonald@xxxxxxxxxxx
>>>>> >> 117 Pleasant St SE           phone: +1 612 625-6905
>>>>> >> Minneapolis, MN 55455        fax:   +1 612 624-8861
>>>>> >>
>>>>> >>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Jeffrey McDonald, PhD
>>>> Assistant Director for HPC Operations
>>>> Minnesota Supercomputing Institute
>>>> University of Minnesota Twin Cities
>>>> 599 Walter Library           email: jeffrey.mcdonald@xxxxxxxxxxx
>>>> 117 Pleasant St SE           phone: +1 612 625-6905
>>>> Minneapolis, MN 55455        fax:   +1 612 624-8861
>>>>
>>>>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com