Re: OSDs crashing with Operation Not Permitted on reading PGLog

Samuel Just <sam.just@xxxxxxxxxxx> · Mon, 27 Oct 2014 15:16:05 -0700

If it's a small number of down pgs though... I think
ceph_objectstore_tool might be able to extract the object data from
the down pgs and reupload them using rados into newly created pgs if
it comes to that.

David: Was that feature in firefly?  If so, did the feature allowing
you to not mount the leveldb store also make it into firefly?  Are
there instructions somewhere on how to do it?
-Sam

On Mon, Oct 27, 2014 at 3:11 PM, Samuel Just <sam.just@xxxxxxxxxxx> wrote:
> Also, if an ipmi reset borked the machines, there is something wrong
> with your hardware or (I'm assuming) xfs version.
> -Sam
>
> On Mon, Oct 27, 2014 at 3:10 PM, Samuel Just <sam.just@xxxxxxxxxxx> wrote:
>> You might try asking the leveldb folks about possibly repairing it.
>> -Sam
>>
>> On Mon, Oct 27, 2014 at 3:09 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
>>> On 10/27/2014 11:00 PM, Samuel Just wrote:
>>>> Try running with osd_leveldb_paranoid=true and
>>>> osd_leveldb_log=/var/log/ceph/osd/ceph-osd.<id>.log.leveldb on that
>>>> osd.
>>>
>>> Done and it was quite a clear message from leveldb:
>>>
>>> 2014/10/27-23:06:56.525355 7f14d0ea9800 Recovering log #164296
>>> 2014/10/27-23:06:56.554527 7f14d0ea9800 Delete type=0 #164296
>>> 2014/10/27-23:06:56.554644 7f14d0ea9800 Delete type=2 #164297
>>> 2014/10/27-23:06:56.555415 7f14d0ea9800 Delete type=2 #164298
>>> 2014/10/27-23:06:56.555709 7f14d0ea9800 Delete type=3 #164295
>>> 2014/10/27-23:06:56.556116 7f14cbc45700 Compacting 1@1 + 2@2 files
>>> 2014/10/27-23:06:56.626336 7f14cbc45700 Generated table #164299: 57
>>> keys, 2193624 bytes
>>> 2014/10/27-23:06:56.642292 7f14cbc45700 compacted to: files[ 10 15 32 0
>>> 0 0 0 ]
>>> 2014/10/27-23:06:56.642310 7f14cbc45700 Compaction error: Corruption:
>>> block checksum mismatch
>>>
>>> What happened at this cluster is that a admin made a mistake and
>>> accidentally resetted all machines using the IPMI, so all the
>>> filesystems (and thus leveldb) were not closed properly.
>>>
>>> 5 OSDs however didn't seem to have survived. (Which now causes 4 PGs to
>>> be down).
>>>
>>> Wido
>>>
>>>> -Sam
>>>>
>>>> On Mon, Oct 27, 2014 at 2:56 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
>>>>> On 10/27/2014 10:55 PM, Samuel Just wrote:
>>>>>> There is nothing in dmesg?
>>>>>
>>>>> No. The filesystem mounts cleanly and I even ran xfs_repair to see if
>>>>> there was anything wrong with it.
>>>>>
>>>>> All goes just fine. It's only the OSD which is crashing.
>>>>>
>>>>> Wido
>>>>>
>>>>>> -Sam
>>>>>>
>>>>>> On Mon, Oct 27, 2014 at 2:53 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
>>>>>>> On 10/27/2014 10:52 PM, Samuel Just wrote:
>>>>>>>> I mean, the 5 osds, different nodes?
>>>>>>>
>>>>>>> Yes. The cluster consists out of 16 nodes and all these OSDs are on
>>>>>>> different nodes.
>>>>>>>
>>>>>>> All running Ubuntu 12.04 with Ceph 0.80.7
>>>>>>>
>>>>>>> Wido
>>>>>>>
>>>>>>>> -Sam
>>>>>>>>
>>>>>>>> On Mon, Oct 27, 2014 at 2:50 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
>>>>>>>>> On 10/27/2014 10:48 PM, Samuel Just wrote:
>>>>>>>>>> Different nodes?
>>>>>>>>>
>>>>>>>>> No, they are both from osd.25
>>>>>>>>>
>>>>>>>>> I re-ran the strace with a empty logfile since the old logfile became
>>>>>>>>> pretty big.
>>>>>>>>>
>>>>>>>>> Wido
>>>>>>>>>
>>>>>>>>>> -Sam
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 27, 2014 at 2:43 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
>>>>>>>>>>> On 10/27/2014 10:35 PM, Samuel Just wrote:
>>>>>>>>>>>> The file is supposed to be 0 bytes, can you attach the log which went
>>>>>>>>>>>> with that strace?
>>>>>>>>>>>
>>>>>>>>>>> Yes, two URLs:
>>>>>>>>>>>
>>>>>>>>>>> * http://ceph.o.auroraobjects.eu/ceph-osd.25.log.gz
>>>>>>>>>>> * http://ceph.o.auroraobjects.eu/ceph-osd.25.strace.gz
>>>>>>>>>>>
>>>>>>>>>>> It was with debug_filestore on 20.
>>>>>>>>>>>
>>>>>>>>>>> Wido
>>>>>>>>>>>
>>>>>>>>>>>> -Sam
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Oct 27, 2014 at 2:16 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
>>>>>>>>>>>>> On 10/27/2014 10:05 PM, Samuel Just wrote:
>>>>>>>>>>>>>> Try reproducing with an strace.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I did so and this is the result:
>>>>>>>>>>>>> http://ceph.o.auroraobjects.eu/ceph-osd.25.strace.gz
>>>>>>>>>>>>>
>>>>>>>>>>>>> It does this stat:
>>>>>>>>>>>>>
>>>>>>>>>>>>> stat("/var/lib/ceph/osd/ceph-25/current/meta/DIR_D/DIR_C"
>>>>>>>>>>>>>
>>>>>>>>>>>>> That fails with: -1 ENOENT (No such file or directory)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Afterwards it open this pglog:
>>>>>>>>>>>>> /var/lib/ceph/osd/ceph-25/current/meta/DIR_D/pglog\\u14.1a56__0_A1630ECD__none
>>>>>>>>>>>>>
>>>>>>>>>>>>> That file is however 0 bytes. (And all other files in the same directory).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Afterwards the OSD asserts and writes to the log.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Wido
>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Sam
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Oct 27, 2014 at 2:02 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On a 0.80.7 cluster I'm experiencing a couple of OSDs refusing to start
>>>>>>>>>>>>>>> due to a crash they encounter when reading the PGLog.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> A snippet of the log:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    -11> 2014-10-27 21:56:04.690046 7f034a006800 10
>>>>>>>>>>>>>>> filestore(/var/lib/ceph/osd/ceph-25) _do_transaction on 0x392e600
>>>>>>>>>>>>>>>    -10> 2014-10-27 21:56:04.690078 7f034a006800 20
>>>>>>>>>>>>>>> filestore(/var/lib/ceph/osd/ceph-25) _check_global_replay_guard no xattr
>>>>>>>>>>>>>>>     -9> 2014-10-27 21:56:04.690140 7f034a006800 20
>>>>>>>>>>>>>>> filestore(/var/lib/ceph/osd/ceph-25) _check_replay_guard no xattr
>>>>>>>>>>>>>>>     -8> 2014-10-27 21:56:04.690150 7f034a006800 15
>>>>>>>>>>>>>>> filestore(/var/lib/ceph/osd/ceph-25) touch meta/a1630ecd/pglog_14.1a56/0//-1
>>>>>>>>>>>>>>>     -7> 2014-10-27 21:56:04.690184 7f034a006800 10
>>>>>>>>>>>>>>> filestore(/var/lib/ceph/osd/ceph-25) touch
>>>>>>>>>>>>>>> meta/a1630ecd/pglog_14.1a56/0//-1 = 0
>>>>>>>>>>>>>>>     -6> 2014-10-27 21:56:04.690196 7f034a006800 15
>>>>>>>>>>>>>>> filestore(/var/lib/ceph/osd/ceph-25) _omap_rmkeys
>>>>>>>>>>>>>>> meta/a1630ecd/pglog_14.1a56/0//-1
>>>>>>>>>>>>>>>     -5> 2014-10-27 21:56:04.690290 7f034a006800 10 filestore oid:
>>>>>>>>>>>>>>> a1630ecd/pglog_14.1a56/0//-1 not skipping op, *spos 1435883.0.2
>>>>>>>>>>>>>>>     -4> 2014-10-27 21:56:04.690295 7f034a006800 10 filestore  >
>>>>>>>>>>>>>>> header.spos 0.0.0
>>>>>>>>>>>>>>>     -3> 2014-10-27 21:56:04.690314 7f034a006800  0
>>>>>>>>>>>>>>> filestore(/var/lib/ceph/osd/ceph-25)  error (1) Operation not permitted
>>>>>>>>>>>>>>> not handled on operation 33 (1435883.0.2, or op 2, counting from 0)
>>>>>>>>>>>>>>>     -2> 2014-10-27 21:56:04.690325 7f034a006800  0
>>>>>>>>>>>>>>> filestore(/var/lib/ceph/osd/ceph-25) unexpected error code
>>>>>>>>>>>>>>>     -1> 2014-10-27 21:56:04.690327 7f034a006800  0
>>>>>>>>>>>>>>> filestore(/var/lib/ceph/osd/ceph-25)  transaction dump:
>>>>>>>>>>>>>>> { "ops": [
>>>>>>>>>>>>>>>         { "op_num": 0,
>>>>>>>>>>>>>>>           "op_name": "nop"},
>>>>>>>>>>>>>>>         { "op_num": 1,
>>>>>>>>>>>>>>>           "op_name": "touch",
>>>>>>>>>>>>>>>           "collection": "meta",
>>>>>>>>>>>>>>>           "oid": "a1630ecd\/pglog_14.1a56\/0\/\/-1"},
>>>>>>>>>>>>>>>         { "op_num": 2,
>>>>>>>>>>>>>>>           "op_name": "omap_rmkeys",
>>>>>>>>>>>>>>>           "collection": "meta",
>>>>>>>>>>>>>>>           "oid": "a1630ecd\/pglog_14.1a56\/0\/\/-1"},
>>>>>>>>>>>>>>>         { "op_num": 3,
>>>>>>>>>>>>>>>           "op_name": "omap_setkeys",
>>>>>>>>>>>>>>>           "collection": "meta",
>>>>>>>>>>>>>>>           "oid": "a1630ecd\/pglog_14.1a56\/0\/\/-1",
>>>>>>>>>>>>>>>           "attr_lens": { "can_rollback_to": 12}}]}
>>>>>>>>>>>>>>>      0> 2014-10-27 21:56:04.691992 7f034a006800 -1 os/FileStore.cc: In
>>>>>>>>>>>>>>> function 'unsigned int
>>>>>>>>>>>>>>> FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int,
>>>>>>>>>>>>>>> ThreadPool::TPHandle*)' thread 7f034a006800 time 2014-10-27 21:56:04.690368
>>>>>>>>>>>>>>> os/FileStore.cc: 2559: FAILED assert(0 == "unexpected error")
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The backing XFS filesystem seems to be OK, but isn't this a leveldb
>>>>>>>>>>>>>>> issue where the omap information is stored?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Anyone seen this before? I have about 5 OSDs (out of the 336) which are
>>>>>>>>>>>>>>> showing this problem when booting.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Wido den Hollander
>>>>>>>>>>>>>>> 42on B.V.
>>>>>>>>>>>>>>> Ceph trainer and consultant
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Phone: +31 (0)20 700 9902
>>>>>>>>>>>>>>> Skype: contact42on
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Wido den Hollander
>>>>>>>>>>>>> 42on B.V.
>>>>>>>>>>>>> Ceph trainer and consultant
>>>>>>>>>>>>>
>>>>>>>>>>>>> Phone: +31 (0)20 700 9902
>>>>>>>>>>>>> Skype: contact42on
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Wido den Hollander
>>>>>>>>>>> 42on B.V.
>>>>>>>>>>> Ceph trainer and consultant
>>>>>>>>>>>
>>>>>>>>>>> Phone: +31 (0)20 700 9902
>>>>>>>>>>> Skype: contact42on
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Wido den Hollander
>>>>>>>>> 42on B.V.
>>>>>>>>> Ceph trainer and consultant
>>>>>>>>>
>>>>>>>>> Phone: +31 (0)20 700 9902
>>>>>>>>> Skype: contact42on
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Wido den Hollander
>>>>>>> 42on B.V.
>>>>>>> Ceph trainer and consultant
>>>>>>>
>>>>>>> Phone: +31 (0)20 700 9902
>>>>>>> Skype: contact42on
>>>>>
>>>>>
>>>>> --
>>>>> Wido den Hollander
>>>>> 42on B.V.
>>>>> Ceph trainer and consultant
>>>>>
>>>>> Phone: +31 (0)20 700 9902
>>>>> Skype: contact42on
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>>
>>> --
>>> Wido den Hollander
>>> 42on B.V.
>>> Ceph trainer and consultant
>>>
>>> Phone: +31 (0)20 700 9902
>>> Skype: contact42on
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html