On 10/27/2014 10:52 PM, Samuel Just wrote: > I mean, the 5 osds, different nodes? Yes. The cluster consists out of 16 nodes and all these OSDs are on different nodes. All running Ubuntu 12.04 with Ceph 0.80.7 Wido > -Sam > > On Mon, Oct 27, 2014 at 2:50 PM, Wido den Hollander <wido@xxxxxxxx> wrote: >> On 10/27/2014 10:48 PM, Samuel Just wrote: >>> Different nodes? >> >> No, they are both from osd.25 >> >> I re-ran the strace with a empty logfile since the old logfile became >> pretty big. >> >> Wido >> >>> -Sam >>> >>> On Mon, Oct 27, 2014 at 2:43 PM, Wido den Hollander <wido@xxxxxxxx> wrote: >>>> On 10/27/2014 10:35 PM, Samuel Just wrote: >>>>> The file is supposed to be 0 bytes, can you attach the log which went >>>>> with that strace? >>>> >>>> Yes, two URLs: >>>> >>>> * http://ceph.o.auroraobjects.eu/ceph-osd.25.log.gz >>>> * http://ceph.o.auroraobjects.eu/ceph-osd.25.strace.gz >>>> >>>> It was with debug_filestore on 20. >>>> >>>> Wido >>>> >>>>> -Sam >>>>> >>>>> On Mon, Oct 27, 2014 at 2:16 PM, Wido den Hollander <wido@xxxxxxxx> wrote: >>>>>> On 10/27/2014 10:05 PM, Samuel Just wrote: >>>>>>> Try reproducing with an strace. >>>>>> >>>>>> I did so and this is the result: >>>>>> http://ceph.o.auroraobjects.eu/ceph-osd.25.strace.gz >>>>>> >>>>>> It does this stat: >>>>>> >>>>>> stat("/var/lib/ceph/osd/ceph-25/current/meta/DIR_D/DIR_C" >>>>>> >>>>>> That fails with: -1 ENOENT (No such file or directory) >>>>>> >>>>>> Afterwards it open this pglog: >>>>>> /var/lib/ceph/osd/ceph-25/current/meta/DIR_D/pglog\\u14.1a56__0_A1630ECD__none >>>>>> >>>>>> That file is however 0 bytes. (And all other files in the same directory). >>>>>> >>>>>> Afterwards the OSD asserts and writes to the log. >>>>>> >>>>>> Wido >>>>>> >>>>>>> -Sam >>>>>>> >>>>>>> On Mon, Oct 27, 2014 at 2:02 PM, Wido den Hollander <wido@xxxxxxxx> wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> On a 0.80.7 cluster I'm experiencing a couple of OSDs refusing to start >>>>>>>> due to a crash they encounter when reading the PGLog. >>>>>>>> >>>>>>>> A snippet of the log: >>>>>>>> >>>>>>>> -11> 2014-10-27 21:56:04.690046 7f034a006800 10 >>>>>>>> filestore(/var/lib/ceph/osd/ceph-25) _do_transaction on 0x392e600 >>>>>>>> -10> 2014-10-27 21:56:04.690078 7f034a006800 20 >>>>>>>> filestore(/var/lib/ceph/osd/ceph-25) _check_global_replay_guard no xattr >>>>>>>> -9> 2014-10-27 21:56:04.690140 7f034a006800 20 >>>>>>>> filestore(/var/lib/ceph/osd/ceph-25) _check_replay_guard no xattr >>>>>>>> -8> 2014-10-27 21:56:04.690150 7f034a006800 15 >>>>>>>> filestore(/var/lib/ceph/osd/ceph-25) touch meta/a1630ecd/pglog_14.1a56/0//-1 >>>>>>>> -7> 2014-10-27 21:56:04.690184 7f034a006800 10 >>>>>>>> filestore(/var/lib/ceph/osd/ceph-25) touch >>>>>>>> meta/a1630ecd/pglog_14.1a56/0//-1 = 0 >>>>>>>> -6> 2014-10-27 21:56:04.690196 7f034a006800 15 >>>>>>>> filestore(/var/lib/ceph/osd/ceph-25) _omap_rmkeys >>>>>>>> meta/a1630ecd/pglog_14.1a56/0//-1 >>>>>>>> -5> 2014-10-27 21:56:04.690290 7f034a006800 10 filestore oid: >>>>>>>> a1630ecd/pglog_14.1a56/0//-1 not skipping op, *spos 1435883.0.2 >>>>>>>> -4> 2014-10-27 21:56:04.690295 7f034a006800 10 filestore > >>>>>>>> header.spos 0.0.0 >>>>>>>> -3> 2014-10-27 21:56:04.690314 7f034a006800 0 >>>>>>>> filestore(/var/lib/ceph/osd/ceph-25) error (1) Operation not permitted >>>>>>>> not handled on operation 33 (1435883.0.2, or op 2, counting from 0) >>>>>>>> -2> 2014-10-27 21:56:04.690325 7f034a006800 0 >>>>>>>> filestore(/var/lib/ceph/osd/ceph-25) unexpected error code >>>>>>>> -1> 2014-10-27 21:56:04.690327 7f034a006800 0 >>>>>>>> filestore(/var/lib/ceph/osd/ceph-25) transaction dump: >>>>>>>> { "ops": [ >>>>>>>> { "op_num": 0, >>>>>>>> "op_name": "nop"}, >>>>>>>> { "op_num": 1, >>>>>>>> "op_name": "touch", >>>>>>>> "collection": "meta", >>>>>>>> "oid": "a1630ecd\/pglog_14.1a56\/0\/\/-1"}, >>>>>>>> { "op_num": 2, >>>>>>>> "op_name": "omap_rmkeys", >>>>>>>> "collection": "meta", >>>>>>>> "oid": "a1630ecd\/pglog_14.1a56\/0\/\/-1"}, >>>>>>>> { "op_num": 3, >>>>>>>> "op_name": "omap_setkeys", >>>>>>>> "collection": "meta", >>>>>>>> "oid": "a1630ecd\/pglog_14.1a56\/0\/\/-1", >>>>>>>> "attr_lens": { "can_rollback_to": 12}}]} >>>>>>>> 0> 2014-10-27 21:56:04.691992 7f034a006800 -1 os/FileStore.cc: In >>>>>>>> function 'unsigned int >>>>>>>> FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, >>>>>>>> ThreadPool::TPHandle*)' thread 7f034a006800 time 2014-10-27 21:56:04.690368 >>>>>>>> os/FileStore.cc: 2559: FAILED assert(0 == "unexpected error") >>>>>>>> >>>>>>>> >>>>>>>> The backing XFS filesystem seems to be OK, but isn't this a leveldb >>>>>>>> issue where the omap information is stored? >>>>>>>> >>>>>>>> Anyone seen this before? I have about 5 OSDs (out of the 336) which are >>>>>>>> showing this problem when booting. >>>>>>>> >>>>>>>> -- >>>>>>>> Wido den Hollander >>>>>>>> 42on B.V. >>>>>>>> Ceph trainer and consultant >>>>>>>> >>>>>>>> Phone: +31 (0)20 700 9902 >>>>>>>> Skype: contact42on >>>>>>>> -- >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>> >>>>>> >>>>>> -- >>>>>> Wido den Hollander >>>>>> 42on B.V. >>>>>> Ceph trainer and consultant >>>>>> >>>>>> Phone: +31 (0)20 700 9902 >>>>>> Skype: contact42on >>>> >>>> >>>> -- >>>> Wido den Hollander >>>> 42on B.V. >>>> Ceph trainer and consultant >>>> >>>> Phone: +31 (0)20 700 9902 >>>> Skype: contact42on >> >> >> -- >> Wido den Hollander >> 42on B.V. >> Ceph trainer and consultant >> >> Phone: +31 (0)20 700 9902 >> Skype: contact42on -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html