Re: OSD crash when one of the PGs have problem

Evgeniy Firsov <Evgeniy.Firsov@xxxxxxxxxxx> · Mon, 1 Feb 2016 23:55:31 +0000

I filed a PR which fixes the bug: https://github.com/ceph/ceph/pull/7470

But there are at least two questions which need clarification:
1. I expect, and my tests confirm so far, that FileStore::_touch call is
NOT used on fast path. But there is an opinion that it might.
2. Does the fix need write_if_dirty part, which writes metadata besides
version info, such as epoch? The question is because PG::_init doesn't
have this part and writes version info only.

Thank you.

On 1/20/16, 5:07 PM, "Evgeniy Firsov" <Evgeniy.Firsov@xxxxxxxxxxx> wrote:

>The same test case fail with stock Jewel branch too.
>
>On 1/20/16, 5:02 PM, "ceph-devel-owner@xxxxxxxxxxxxxxx on behalf of
>Evgeniy Firsov" <ceph-devel-owner@xxxxxxxxxxxxxxx on behalf of
>Evgeniy.Firsov@xxxxxxxxxxx> wrote:
>
>>I am able to reproduce the problem even with current, unmodified master.
>>
>>Test case:
>>1. Start 2 OSDs, 8 PGs, pool size = 2
>>2. Run write workload for some time.
>>3. Stop workload
>>4. rm -rf dev/osd0/current/0.2_head/*
>>5. ceph osd scrub 0
>>6. ceph pg repair 0.2
>>7. Restart OSD.
>>8. Get "error (17) File exists not handled on operation"
>>
>>On 1/20/16, 9:30 AM, "ceph-devel-owner@xxxxxxxxxxxxxxx on behalf of
>>Varada
>>Kari" <ceph-devel-owner@xxxxxxxxxxxxxxx on behalf of
>>Varada.Kari@xxxxxxxxxxx> wrote:
>>
>>>Somnath,
>>>
>>>There are no xfs errors here.
>>>
>>>Varada
>>>
>>>> -----Original Message-----
>>>> From: Somnath Roy
>>>> Sent: Wednesday, January 20, 2016 10:34 PM
>>>> To: Varada Kari <Varada.Kari@xxxxxxxxxxx>; Sage Weil
>>>> <sage@xxxxxxxxxxxx>
>>>> Cc: ceph-devel@xxxxxxxxxxxxxxx
>>>> Subject: RE: OSD crash when one of the PGs have problem
>>>> 
>>>> You are saying xfs is throwing errors in the dmesg, so, this could be
>>>>a
>>>>reason
>>>> why ceph is not able to read PG metadata for leveldb->xfs, isn't it ?
>>>> 
>>>> Thanks & Regards
>>>> Somnath
>>>> 
>>>> -----Original Message-----
>>>> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
>>>> owner@xxxxxxxxxxxxxxx] On Behalf Of Varada Kari
>>>> Sent: Wednesday, January 20, 2016 8:02 AM
>>>> To: Sage Weil
>>>> Cc: ceph-devel@xxxxxxxxxxxxxxx
>>>> Subject: RE: OSD crash when one of the PGs have problem
>>>> 
>>>> Yes Sage. Applied Somnath's write path changes on top the jewel branch
>>>> (two weeks old) and jemalloc. Haven't tried with latest changes
>>>>though.
>>>>With
>>>> the same code, we are not able hit this issue again, when retried.
>>>>Not
>>>>sure if
>>>> we are hitting any race in committing change here.
>>>> 
>>>> Varada
>>>> 
>>>> > -----Original Message-----
>>>> > From: Sage Weil [mailto:sage@xxxxxxxxxxxx]
>>>> > Sent: Wednesday, January 20, 2016 9:04 PM
>>>> > To: Varada Kari <Varada.Kari@xxxxxxxxxxx>
>>>> > Cc: ceph-devel@xxxxxxxxxxxxxxx
>>>> > Subject: Re: OSD crash when one of the PGs have problem
>>>> >
>>>> > On Wed, 20 Jan 2016, Varada Kari wrote:
>>>> > > Hi all,
>>>> > >
>>>> > >
>>>> > > Sorry for the long mail. Wanted to give more information in the
>>>>mail.
>>>> > >
>>>> > > We are seeing an issue in our test cluster when an osd is stopped
>>>> > > gracefully
>>>> > and restarted after the maintenance.
>>>> > > Please note, haven't used any no out option for the cluster.
>>>> > > After the reboot, not able to one of the PG, and no drive problems
>>>> > > are
>>>> > read errors are logged in dmesg/kernel logs. But we are not able to
>>>> > read the head object of PG for some reason, sorry of lack of logs we
>>>> > are running with debugs with 0/0 and build with jemalloc enabled for
>>>> > testing. Hence a customized version.
>>>> > >
>>>> > > <snip>
>>>> > > 2015-12-27 13:14:39.684307 7f5ae47ff700 -1 osd.1 405 *** Got
>>>>signal
>>>> > > Terminated ***
>>>> > > 2015-12-27 13:14:39.779307 7f5ae47ff700  0 osd.1 405
>>>>prepare_to_stop
>>>> > > telling mon we are shutting down
>>>> > > 2016-01-04 10:46:13.966959 7fad39d1e800  0 ceph version
>>>> > > 10.0.0(7eb76e73fd5b6e990ceba9d45e5a1fec92145fa5), process
>>>>ceph-osd,
>>>> > > pid 6975
>>>> > > 2016-01-04 10:46:14.007516 7fad39d1e800  0
>>>> > > filestore(/var/lib/ceph/osd/ceph-1) backend xfs (magic 0x58465342)
>>>> > > 2016-01-04 10:46:14.008565 7fad39d1e800  0
>>>> > > genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features:
>>>> > > FIEMAP ioctl is disabled via 'filestore fiemap' config option
>>>> > > 2016-01-04 10:46:14.008573 7fad39d1e800  0
>>>> > > genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features:
>>>> > > SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole'
>>>> > > config option
>>>> > > 2016-01-04 10:46:14.008601 7fad39d1e800  0
>>>> > > genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features:
>>>> > > splice is supported
>>>> > > 2016-01-04 10:46:14.008882 7fad39d1e800  0
>>>> > > genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features:
>>>> > > syncfs(2) syscall fully supported (by glibc and kernel)
>>>> > > 2016-01-04 10:46:14.008921 7fad39d1e800  0
>>>> > > xfsfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features:
>>>> > > extsize is supported and your kernel >= 3.5
>>>> > > 2016-01-04 10:46:14.339253 7fad39d1e800  0
>>>> > > filestore(/var/lib/ceph/osd/ceph-1) mount: enabling WRITEAHEAD
>>>> > > journal
>>>> > > mode: checkpoint is not enabled
>>>> > > 2016-01-04 10:46:14.532510 7fad39d1e800  0 <cls>
>>>> > > cls/hello/cls_hello.cc:305: loading cls_hello
>>>> > > 2016-01-04 10:46:14.534168 7fad39d1e800  0 <cls>
>>>> > > cls/cephfs/cls_cephfs.cc:136: loading cephfs_size_scan
>>>> > > 2016-01-04 10:46:14.535626 7fad39d1e800  0 osd.1 405 crush map has
>>>> > > features 1107558400, adjusting msgr requires for clients
>>>> > > 2016-01-04 10:46:14.535634 7fad39d1e800  0 osd.1 405 crush map has
>>>> > > features 1107558400 was 8705, adjusting msgr requires for mons
>>>> > > 2016-01-04 10:46:14.535639 7fad39d1e800  0 osd.1 405 crush map has
>>>> > > features 1107558400, adjusting msgr requires for osds
>>>> > > 2016-01-04 10:46:14.954060 7fad39d1e800  0 osd.1 405 load_pgs
>>>> > > 2016-01-04 10:46:15.634892 7fad39d1e800 -1 osd.1 405 load_pgs
>>>>unable
>>>> > > to peek at 1.29e metadata, skipping
>>>> >
>>>> > It looks like the pg's metadata didn't commit properly.  It doesn't
>>>> > find it on startup, skips the pg, and then crashes below when it
>>>>tries
>>>> > to create it and the collection already exists.
>>>> >
>>>> > Are there any changes applied on top of upstream filestore?  Also,
>>>>is
>>>> > this reproducible on a later version of the code?
>>>> >
>>>> > > 2016-01-04 10:46:16.334253 7fad39d1e800  0 osd.1 405 load_pgs
>>>>opened
>>>> > > 158 pgs
>>>> > > 2016-01-04 10:46:16.336960 7fad39d1e800 -1 osd.1 405
>>>>log_to_monitors
>>>> > > {default=true}
>>>> > > 2016-01-04 10:46:16.702999 7fad273be700  0
>>>> > > filestore(/var/lib/ceph/osd/ceph-1)  error (17) File exists not
>>>> > > handled on operation 0x7fad2c83f1c0 (201318686.0.0, or op 0,
>>>> > > counting from 0)
>>>> > > 2016-01-04 10:46:16.703014 7fad273be700  0
>>>> > > filestore(/var/lib/ceph/osd/ceph-1) unexpected error code
>>>> > > 2016-01-04 10:46:16.703017 7fad273be700  0
>>>> > filestore(/var/lib/ceph/osd/ceph-1)  transaction dump:
>>>> > > {
>>>> > >     "ops": [
>>>> > >         {
>>>> > >             "op_num": 0,
>>>> > >             "op_name": "mkcoll",
>>>> > >             "collection": "1.29e_head"
>>>> > >         },
>>>> > >         {
>>>> > >             "op_num": 1,
>>>> > >             "op_name": "coll_hint",
>>>> > >             "collection": "1.29e_head",
>>>> > >             "type": 1,
>>>> > >             "pg_num": 800,
>>>> > >             "expected_num_objects": 0
>>>> > >         },
>>>> > >         {
>>>> > >             "op_num": 2,
>>>> > >             "op_name": "touch",
>>>> > >             "collection": "1.29e_head",
>>>> > >             "oid": "1\/0000029e\/\/head"
>>>> > >         },
>>>> > >         {
>>>> > >             "op_num": 3,
>>>> > >             "op_name": "omap_setkeys",
>>>> > >             "collection": "1.29e_head",
>>>> > >             "oid": "1\/0000029e\/\/head",
>>>> > >             "attr_lens": {
>>>> > >                 "_infover": 1
>>>> > >             }
>>>> > >         },
>>>> > >         {
>>>> > >             "op_num": 4,
>>>> > >             "op_name": "touch",
>>>> > >             "collection": "1.29e_head",
>>>> > >             "oid": "1\/0000029e\/\/head"
>>>> > >         },
>>>> > >         {
>>>> > >             "op_num": 5,
>>>> > >             "op_name": "omap_setkeys",
>>>> > >             "collection": "1.29e_head",
>>>> > >             "oid": "1\/0000029e\/\/head",
>>>> > >             "attr_lens": {
>>>> > >                 "_biginfo": 196,
>>>> > >                 "_epoch": 4,
>>>> > >                 "_info": 847,
>>>> > >                 "can_rollback_to": 12,
>>>> > >                 "rollback_info_trimmed_to": 12
>>>> > >             }
>>>> > >         }
>>>> > >     ]
>>>> > > }
>>>> > >
>>>> > > 2016-01-04 10:46:16.731323 7fad273be700 -1 os/FileStore.cc: In
>>>> > > function 'unsigned int
>>>> > > FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t,
>>>>int,
>>>> > > ThreadPool::TPHandle*)' thread 7fad273be700 time 2016-01-04
>>>> > > 10:46:16.703103
>>>> > > os/FileStore.cc: 3017: FAILED assert(0 == "unexpected error")
>>>> > >
>>>> > >  ceph version 10.0.0 (7eb76e73fd5b6e990ceba9d45e5a1fec92145fa5)
>>>> > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>>> > > const*)+0x8b) [0x7fad39846a6b]
>>>> > >  2: (FileStore::_do_transaction(ObjectStore::Transaction&,
>>>>unsigned
>>>> > > long, int, ThreadPool::TPHandle*)+0xa48) [0x7fad3968bf58]
>>>> > >  3:
>>>> > > (FileStore::_do_transactions(std::list<ObjectStore::Transaction*,
>>>> > > std::allocator<ObjectStore::Transaction*> >&, unsigned long,
>>>> > > ThreadPool::TPHandle*)+0x64) [0x7fad396932a4]
>>>> > >  4: (FileStore::_do_op(FileStore::OpSequencer*,
>>>> > > ThreadPool::TPHandle&)+0x2c3) [0x7fad39693583]
>>>> > >  5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa56)
>>>> > > [0x7fad398384e6]
>>>> > >  6: (ThreadPool::WorkThread::entry()+0x10) [0x7fad398393b0]
>>>> > >  7: (()+0x8182) [0x7fad38609182]
>>>> > >  8: (clone()+0x6d) [0x7fad367fc47d]
>>>> > > <snip>
>>>> > >
>>>> > >  On the reboot, because of the journal reply, we didn't see any
>>>> > > transaction
>>>> > dump issues this time, but we are seeing the following problem.
>>>> > >
>>>> > >  <snip>
>>>> > > 2016-01-04 10:47:09.302400 7f8190a3c800  0 ceph version 10.0.0
>>>> > > (7eb76e73fd5b6e990ceba9d45e5a1fec92145fa5), process ceph-osd, pid
>>>> > 9488
>>>> > > 2016-01-04 10:47:09.310055 7f8190a3c800  0
>>>> > > filestore(/var/lib/ceph/osd/ceph-1) backend xfs (magic 0x58465342)
>>>> > > 2016-01-04 10:47:09.310379 7f8190a3c800  0
>>>> > > genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features:
>>>> > > FIEMAP ioctl is disabled via 'filestore fiemap' config option
>>>> > > 2016-01-04 10:47:09.310384 7f8190a3c800  0
>>>> > > genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features:
>>>> > > SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole'
>>>> > > config option
>>>> > > 2016-01-04 10:47:09.310398 7f8190a3c800  0
>>>> > > genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features:
>>>> > > splice is supported
>>>> > > 2016-01-04 10:47:09.310869 7f8190a3c800  0
>>>> > > genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features:
>>>> > > syncfs(2) syscall fully supported (by glibc and kernel)
>>>> > > 2016-01-04 10:47:09.310935 7f8190a3c800  0
>>>> > > xfsfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features:
>>>> > > extsize is supported and your kernel >= 3.5
>>>> > > 2016-01-04 10:47:09.322795 7f8190a3c800  0
>>>> > > filestore(/var/lib/ceph/osd/ceph-1) mount: enabling WRITEAHEAD
>>>> > > journal
>>>> > > mode: checkpoint is not enabled
>>>> > > 2016-01-04 10:47:09.322828 7f8190a3c800  0
>>>> > > filestore(/var/lib/ceph/osd/ceph-1) disabling wbthrottle(
>>>> > > filestore_wbthrottle_enable = 0 ) since do_fast_sync is enabled
>>>> > > 2016-01-04 10:47:09.327861 7f8183ee0700  0
>>>> > > filestore(/var/lib/ceph/osd/ceph-1) ## Fast sync is enabled ##
>>>> > > 2016-01-04 10:47:09.392333 7f8190a3c800  0 <cls>
>>>> > > cls/hello/cls_hello.cc:305: loading cls_hello
>>>> > > 2016-01-04 10:47:09.393968 7f8190a3c800  0 <cls>
>>>> > > cls/cephfs/cls_cephfs.cc:136: loading cephfs_size_scan
>>>> > > 2016-01-04 10:47:09.394528 7f8190a3c800  0 osd.1 410 crush map has
>>>> > > features 1107558400, adjusting msgr requires for clients
>>>> > > 2016-01-04 10:47:09.394536 7f8190a3c800  0 osd.1 410 crush map has
>>>> > > features 1107558400 was 8705, adjusting msgr requires for mons
>>>> > > 2016-01-04 10:47:09.394552 7f8190a3c800  0 osd.1 410 crush map has
>>>> > > features 1107558400, adjusting msgr requires for osds
>>>> > > 2016-01-04 10:47:09.458215 7f8190a3c800  0 osd.1 410 load_pgs
>>>> > > 2016-01-04 10:47:10.935841 7f8190a3c800  0 osd.1 410 load_pgs
>>>>opened
>>>> > > 159 pgs
>>>> > > 2016-01-04 10:47:10.936696 7f8190a3c800 -1 osd.1 410
>>>>log_to_monitors
>>>> > > {default=true}
>>>> > > 2016-01-04 10:47:13.942139 7f8190a3c800  0 osd.1 410 done with
>>>>init,
>>>> > > starting boot process
>>>> > > 2016-01-04 10:47:14.117463 7f817d1f3700  0 --
>>>> > > 10.242.43.141:6825/9488
>>>> > > >> 10.242.43.143:6821/40041 pipe(0x7f8141c0e000 sd=83 :6825 s=0
>>>> > > >> pgs=0
>>>> > > cs=0 l=0 c=0x7f8141c14180).accept connect_seq 0 vs existing 0
>>>>state
>>>> > > connecting
>>>> > > 2016-01-04 10:47:15.649791 7f812eefe700  0 --
>>>> > > 10.242.43.141:6825/9488
>>>> > > >> 10.242.43.141:6801/6106 pipe(0x7f8141c0e000 sd=93 :6825 s=0
>>>>pgs=0
>>>> > > cs=0 l=0 c=0x7f8141c14480).accept connect_seq 0 vs existing 0
>>>>state
>>>> > > connecting
>>>> > > 2016-01-04 10:47:15.649826 7f812edfd700  0 --
>>>> > > 10.242.43.141:6825/9488
>>>> > > >> 10.242.43.141:6821/6925 pipe(0x7f8141c1c000 sd=137 :6825 s=0
>>>> > > >> pgs=0
>>>> > > cs=0 l=0 c=0x7f8141c14780).accept connect_seq 0 vs existing 0
>>>>state
>>>> > > connecting
>>>> > > 2016-01-04 10:47:15.924059 7f81527ed700 -1 osd/PGLog.h: In
>>>>function
>>>> > > 'void PGLog::IndexedLog::claim_log_and_clear_rollback_info(const
>>>> > > pg_log_t&)' thread 7f81527ed700 time 2016-01-04 10:47:15.906256
>>>> > > osd/PGLog.h: 89: FAILED assert(rollback_info_trimmed_to_riter ==
>>>> > > log.rbegin())
>>>> > >
>>>> > >  ceph version 10.0.0 (7eb76e73fd5b6e990ceba9d45e5a1fec92145fa5)
>>>> > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>>> > > const*)+0x8b) [0x7f8190564a6b]
>>>> > >  2: (PG::RecoveryState::Stray::react(PG::MLogRec const&)+0xa21)
>>>> > > [0x7f8190173971]
>>>> > >  3: (boost::statechart::simple_state<PG::RecoveryState::Stray,
>>>> > > PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na,
>>>> > > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
>>>> > > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
>>>> > > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
>>>> > > 
>>>>(boost::statechart::history_mode)0>::react_impl(boost::statechart::e
>>>> > > ve nt_base const&, void const*)+0x1f4) [0x7f81901ad584]
>>>> > >  4:
>>>> > >
>>>> (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine
>>>> > > , PG::RecoveryState::Initial, std::allocator<void>,
>>>> > > 
>>>>boost::statechart::null_exception_translator>::send_event(boost::sta
>>>> > > te chart::event_base const&)+0x5b) [0x7f8190197bab]
>>>> > >  5: (PG::handle_peering_event(std::shared_ptr<PG::CephPeeringEvt>,
>>>> > > PG::RecoveryCtx*)+0x1ce) [0x7f81901412ce]
>>>> > >  6: (OSD::process_peering_events(std::list<PG*,
>>>>std::allocator<PG*>
>>>> > > > const&, ThreadPool::TPHandle&)+0x267) [0x7f81900cb677]
>>>> > >  7: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> >
>>>> > > const&, ThreadPool::TPHandle&)+0x18) [0x7f819010fed8]
>>>> > >  8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa56)
>>>> > > [0x7f81905564e6]
>>>> > >  9: (ThreadPool::WorkThread::entry()+0x10) [0x7f81905573b0]
>>>> > >  10: (()+0x8182) [0x7f818f327182]
>>>> > >  11: (clone()+0x6d) [0x7f818d51a47d]
>>>> > >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>>>> > > needed
>>>> > to interpret this.
>>>> > > <snip>
>>>> > >
>>>> > >  Whenever we restart the osd, we are hitting the problem and not
>>>> > > able to
>>>> > get the osd up. Tried pg repair, which bails saying primary osd
>>>>down.
>>>> > >  Are there any repair techniques available to get this corrected?
>>>> >
>>>> > Most likely deleting the pg directory will be sufficient to clear it
>>>>up.
>>>> >
>>>> > sage
>>>> 
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>in the
>>>> body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at
>>>> http://vger.kernel.org/majordomo-info.html
>>>--
>>>To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>--
>>To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>the body of a message to majordomo@xxxxxxxxxxxxxxx
>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html