Hi Sam,
I was prepared to write in and say that the problem had gone away. I tried restarting several OSDs last night in the hopes of capturing the problem on and OSD that hadn't failed yet, but didn't have any luck. So I did indeed re-create the cluster from scratch (using mkcephfs), and what do you know -- everything worked. I got everything in a nice stable state, then decided to do a full cluster restart, just to be sure. Sure enough, one OSD failed to come up, and has the same stack trace. So I believe I have the log you want -- just from the OSD that failed, right? On Mon, Apr 29, 2013 at 6:26 PM, Travis Rhoden <trhoden@xxxxxxxxx> wrote:
Hi Sam,
No problem, I'll leave that debugging turned up high, and do a mkcephfs from scratch and see what happens. Not sure if it will happen again or not. =)
Thanks again.
- Travis
On Mon, Apr 29, 2013 at 5:51 PM, Samuel Just <sam.just@xxxxxxxxxxx> wrote:
Hmm, I need logging from when the corruption happened. If this is
reproducible, can you enable that logging on a clean osd (or better, a
clean cluster) until the assert occurs?
-Sam
On Mon, Apr 29, 2013 at 2:45 PM, Travis Rhoden <trhoden@xxxxxxxxx> wrote:
> Also, I can note that it does not take a full cluster restart to trigger
> this. If I just restart an OSD that was up/in previously, the same error
> can happen (though not every time). So restarting OSD's for me is a bit
> like Russian roullette. =) Even though restarting an OSD may not also
> result in the error, it seems that once it happens that OSD is gone for
> good. No amount of restart has brought any of the dead ones back.
>
> I'd really like to get to the bottom of it. Let me know if I can do
> anything to help.
>
> I may also have to try completely wiping/rebuilding to see if I can make
> this thing usable.
>
>
> On Mon, Apr 29, 2013 at 2:38 PM, Travis Rhoden <trhoden@xxxxxxxxx> wrote:
>>
>> Hi Sam,
>>
>> Thanks for being willing to take a look.
>>
>> I applied the debug settings on one host that 3 out of 3 OSDs with this
>> problem. Then tried to start them up. Here are the resulting logs:
>>
>> https://dl.dropboxusercontent.com/u/23122069/cephlogs.tgz
>>
>> - Travis
>>
>>
>> On Mon, Apr 29, 2013 at 1:04 PM, Samuel Just <sam.just@xxxxxxxxxxx> wrote:
>>>
>>> You appear to be missing pg metadata for some reason. If you can
>>> reproduce it with
>>> debug osd = 20
>>> debug filestore = 20
>>> debug ms = 1
>>> on all of the OSDs, I should be able to track it down.
>>>
>>> I created a bug: #4855.
>>>
>>> Thanks!
>>> -Sam
>>>
>>> On Mon, Apr 29, 2013 at 9:52 AM, Travis Rhoden <trhoden@xxxxxxxxx> wrote:
>>> > Thanks Greg.
>>> >
>>> > I quit playing with it because every time I restarted the cluster
>>> > (service
>>> > ceph -a restart), I lost more OSDs.. First time it was 1, 2nd 10, 3rd
>>> > time
>>> > 13... All 13 down OSDs all show the same stacktrace.
>>> >
>>> > - Travis
>>> >
>>> >
>>> > On Mon, Apr 29, 2013 at 11:56 AM, Gregory Farnum <greg@xxxxxxxxxxx>
>>> > wrote:
>>> >>
>>> >> This sounds vaguely familiar to me, and I see
>>> >> http://tracker.ceph.com/issues/4052, which is marked as "Can't
>>> >> reproduce" — I think maybe this is fixed in "next" and "master", but
>>> >> I'm not sure. For more than that I'd have to defer to Sage or Sam.
>>> >> -Greg
>>> >> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>> >>
>>> >>
>>> >> On Sat, Apr 27, 2013 at 6:43 PM, Travis Rhoden <trhoden@xxxxxxxxx>
>>> >> wrote:
>>> >> > Hey folks,
>>> >> >
>>> >> > I'm helping put together a new test/experimental cluster, and hit
>>> >> > this
>>> >> > today
>>> >> > when bringing the cluster up for the first time (using mkcephfs).
>>> >> >
>>> >> > After doing the normal "service ceph -a start", I noticed one OSD
>>> >> > was
>>> >> > down,
>>> >> > and a lot of PGs were stuck creating. I tried restarting the down
>>> >> > OSD,
>>> >> > but
>>> >> > it would come up. It always had this error:
>>> >> >
>>> >> > -1> 2013-04-27 18:11:56.179804 b6fcd000 2 osd.1 0 boot
>>> >> > 0> 2013-04-27 18:11:56.402161 b6fcd000 -1 osd/PG.cc: In
>>> >> > function
>>> >> > 'static epoch_t PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&,
>>> >> > ceph::bufferlist*)' thread b6fcd000 time 2013-04-27 18:11:56.399089
>>> >> > osd/PG.cc: 2556: FAILED assert(values.size() == 1)
>>> >> >
>>> >> > ceph version 0.60-401-g17a3859
>>> >> > (17a38593d60f5f29b9b66c13c0aaa759762c6d04)
>>> >> > 1: (PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&,
>>> >> > ceph::buffer::list*)+0x1ad) [0x2c3c0a]
>>> >> > 2: (OSD::load_pgs()+0x357) [0x28cba0]
>>> >> > 3: (OSD::init()+0x741) [0x290a16]
>>> >> > 4: (main()+0x1427) [0x2155c0]
>>> >> > 5: (__libc_start_main()+0x99) [0xb69bcf42]
>>> >> > NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>>> >> > needed to
>>> >> > interpret this.
>>> >> >
>>> >> >
>>> >> > I then did a full cluster restart, and now I have ten OSDs down --
>>> >> > each
>>> >> > showing the same exception/failed assert.
>>> >> >
>>> >> > Anybody seen this?
>>> >> >
>>> >> > I know I'm running a weird version -- it's compiled from source, and
>>> >> > was
>>> >> > provided to me. The OSDs are all on ARM, and the mon is x86_64.
>>> >> > Just
>>> >> > looking to see if anyone has seen this particular stack trace of
>>> >> > load_pgs()/peek_map_epoch() before....
>>> >> >
>>> >> > - Travis
>>> >> >
>>> >> > _______________________________________________
>>> >> > ceph-users mailing list
>>> >> > ceph-users@xxxxxxxxxxxxxx
>>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >> >
>>> >
>>> >
>>
>>
>
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com