Re: OSDs crashing with Operation Not Permitted on reading PGLog

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 10/27/2014 10:48 PM, Samuel Just wrote:
> Different nodes?

No, they are both from osd.25

I re-ran the strace with a empty logfile since the old logfile became
pretty big.

Wido

> -Sam
> 
> On Mon, Oct 27, 2014 at 2:43 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
>> On 10/27/2014 10:35 PM, Samuel Just wrote:
>>> The file is supposed to be 0 bytes, can you attach the log which went
>>> with that strace?
>>
>> Yes, two URLs:
>>
>> * http://ceph.o.auroraobjects.eu/ceph-osd.25.log.gz
>> * http://ceph.o.auroraobjects.eu/ceph-osd.25.strace.gz
>>
>> It was with debug_filestore on 20.
>>
>> Wido
>>
>>> -Sam
>>>
>>> On Mon, Oct 27, 2014 at 2:16 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
>>>> On 10/27/2014 10:05 PM, Samuel Just wrote:
>>>>> Try reproducing with an strace.
>>>>
>>>> I did so and this is the result:
>>>> http://ceph.o.auroraobjects.eu/ceph-osd.25.strace.gz
>>>>
>>>> It does this stat:
>>>>
>>>> stat("/var/lib/ceph/osd/ceph-25/current/meta/DIR_D/DIR_C"
>>>>
>>>> That fails with: -1 ENOENT (No such file or directory)
>>>>
>>>> Afterwards it open this pglog:
>>>> /var/lib/ceph/osd/ceph-25/current/meta/DIR_D/pglog\\u14.1a56__0_A1630ECD__none
>>>>
>>>> That file is however 0 bytes. (And all other files in the same directory).
>>>>
>>>> Afterwards the OSD asserts and writes to the log.
>>>>
>>>> Wido
>>>>
>>>>> -Sam
>>>>>
>>>>> On Mon, Oct 27, 2014 at 2:02 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On a 0.80.7 cluster I'm experiencing a couple of OSDs refusing to start
>>>>>> due to a crash they encounter when reading the PGLog.
>>>>>>
>>>>>> A snippet of the log:
>>>>>>
>>>>>>    -11> 2014-10-27 21:56:04.690046 7f034a006800 10
>>>>>> filestore(/var/lib/ceph/osd/ceph-25) _do_transaction on 0x392e600
>>>>>>    -10> 2014-10-27 21:56:04.690078 7f034a006800 20
>>>>>> filestore(/var/lib/ceph/osd/ceph-25) _check_global_replay_guard no xattr
>>>>>>     -9> 2014-10-27 21:56:04.690140 7f034a006800 20
>>>>>> filestore(/var/lib/ceph/osd/ceph-25) _check_replay_guard no xattr
>>>>>>     -8> 2014-10-27 21:56:04.690150 7f034a006800 15
>>>>>> filestore(/var/lib/ceph/osd/ceph-25) touch meta/a1630ecd/pglog_14.1a56/0//-1
>>>>>>     -7> 2014-10-27 21:56:04.690184 7f034a006800 10
>>>>>> filestore(/var/lib/ceph/osd/ceph-25) touch
>>>>>> meta/a1630ecd/pglog_14.1a56/0//-1 = 0
>>>>>>     -6> 2014-10-27 21:56:04.690196 7f034a006800 15
>>>>>> filestore(/var/lib/ceph/osd/ceph-25) _omap_rmkeys
>>>>>> meta/a1630ecd/pglog_14.1a56/0//-1
>>>>>>     -5> 2014-10-27 21:56:04.690290 7f034a006800 10 filestore oid:
>>>>>> a1630ecd/pglog_14.1a56/0//-1 not skipping op, *spos 1435883.0.2
>>>>>>     -4> 2014-10-27 21:56:04.690295 7f034a006800 10 filestore  >
>>>>>> header.spos 0.0.0
>>>>>>     -3> 2014-10-27 21:56:04.690314 7f034a006800  0
>>>>>> filestore(/var/lib/ceph/osd/ceph-25)  error (1) Operation not permitted
>>>>>> not handled on operation 33 (1435883.0.2, or op 2, counting from 0)
>>>>>>     -2> 2014-10-27 21:56:04.690325 7f034a006800  0
>>>>>> filestore(/var/lib/ceph/osd/ceph-25) unexpected error code
>>>>>>     -1> 2014-10-27 21:56:04.690327 7f034a006800  0
>>>>>> filestore(/var/lib/ceph/osd/ceph-25)  transaction dump:
>>>>>> { "ops": [
>>>>>>         { "op_num": 0,
>>>>>>           "op_name": "nop"},
>>>>>>         { "op_num": 1,
>>>>>>           "op_name": "touch",
>>>>>>           "collection": "meta",
>>>>>>           "oid": "a1630ecd\/pglog_14.1a56\/0\/\/-1"},
>>>>>>         { "op_num": 2,
>>>>>>           "op_name": "omap_rmkeys",
>>>>>>           "collection": "meta",
>>>>>>           "oid": "a1630ecd\/pglog_14.1a56\/0\/\/-1"},
>>>>>>         { "op_num": 3,
>>>>>>           "op_name": "omap_setkeys",
>>>>>>           "collection": "meta",
>>>>>>           "oid": "a1630ecd\/pglog_14.1a56\/0\/\/-1",
>>>>>>           "attr_lens": { "can_rollback_to": 12}}]}
>>>>>>      0> 2014-10-27 21:56:04.691992 7f034a006800 -1 os/FileStore.cc: In
>>>>>> function 'unsigned int
>>>>>> FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int,
>>>>>> ThreadPool::TPHandle*)' thread 7f034a006800 time 2014-10-27 21:56:04.690368
>>>>>> os/FileStore.cc: 2559: FAILED assert(0 == "unexpected error")
>>>>>>
>>>>>>
>>>>>> The backing XFS filesystem seems to be OK, but isn't this a leveldb
>>>>>> issue where the omap information is stored?
>>>>>>
>>>>>> Anyone seen this before? I have about 5 OSDs (out of the 336) which are
>>>>>> showing this problem when booting.
>>>>>>
>>>>>> --
>>>>>> Wido den Hollander
>>>>>> 42on B.V.
>>>>>> Ceph trainer and consultant
>>>>>>
>>>>>> Phone: +31 (0)20 700 9902
>>>>>> Skype: contact42on
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>> --
>>>> Wido den Hollander
>>>> 42on B.V.
>>>> Ceph trainer and consultant
>>>>
>>>> Phone: +31 (0)20 700 9902
>>>> Skype: contact42on
>>
>>
>> --
>> Wido den Hollander
>> 42on B.V.
>> Ceph trainer and consultant
>>
>> Phone: +31 (0)20 700 9902
>> Skype: contact42on


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux