Re: OSD stop and fails

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



This is a new bug AFAICT, but it looks to be an issue in soft
state/logic, not disk state.

If you are seeing it frequently and can reproduce with "debug osd =
20" set, that would help track down what's going on. I've flagged it
for the team but can't investigate it myself.
-Greg

On Wed, Sep 1, 2021 at 6:22 AM Amudhan P <amudhan83@xxxxxxxxx> wrote:
>
> Hi Greg,
>
> Since this cluster is down for a long I am planning to destroy the cluster and re-create it this weekend.
>
> is this bug fixable in the reported ceph version?  and is there any data that would be helpful to fix this issue that I can provide?
>
> Amudhan
>
> On Tue, Aug 31, 2021 at 9:20 AM Amudhan P <amudhan83@xxxxxxxxx> wrote:
>>
>> Gregory,
>>
>> I have raised a ticket already.
>> https://tracker.ceph.com/issues/52445
>>
>> Amudhan
>>
>> On Tue, Aug 31, 2021 at 12:00 AM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>>>
>>> Hmm, this ceph_assert hasn't shown up in my email before. It looks
>>> like there may be a soft-state bug in Octopus. Can you file a ticket
>>> at tracker.ceph.com with the bcaktrace and osd log file? We can direct
>>> that to the RADOS team to check out.
>>> -Greg
>>>
>>> On Sat, Aug 28, 2021 at 7:13 AM Amudhan P <amudhan83@xxxxxxxxx> wrote:
>>> >
>>> > Hi,
>>> >
>>> > I am having a peculiar problem with my ceph octopus cluster. 2 weeks ago I
>>> > had an issue that started like too many scrub error and later random OSD
>>> > stopped which lead to pg corrupt, replica missing. since it's a testing
>>> > cluster I wanted to understand the issue.
>>> > I tried to recover PG but it didn't help. when I use `set norecover,
>>> > norebalance, nodown` OSD service running without stopping.
>>> >
>>> > I have gone thru the steps in ceph osd troubleshooting but nothing helps or
>>> > leads to finding the issue.
>>> >
>>> > I have mailed earlier but couldn't get any solution.
>>> >
>>> > Any help would be appreciated to find out the issue.
>>> >
>>> > *error msg in one of  the OSD which failed.*
>>> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.7/rpm/el8/BUILD/
>>> > ceph-15.2.7/src/osd/OSD.cc: 9521: FAILED ceph_assert(started <=
>>> > reserved_pushes)
>>> >
>>> >  ceph version 15.2.7 (88e41c6c49beb18add4fdb6b4326ca466d931db8) octopus
>>> > (stable)
>>> >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> > const*)+0x158) [0x55fcb6621dbe]
>>> >  2: (()+0x504fd8) [0x55fcb6621fd8]
>>> >  3: (OSD::do_recovery(PG*, unsigned int, unsigned long,
>>> > ThreadPool::TPHandle&)+0x5f5) [0x55fcb6704c25]
>>> >  4: (ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*,
>>> > boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1d) [0x55fcb6960a3d]
>>> >  5: (OSD::ShardedOpWQ::_process(unsigned int,
>>> > ceph::heartbeat_handle_d*)+0x12ef) [0x55fcb67224df]
>>> >  6: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4)
>>> > [0x55fcb6d5b224]
>>> >  7: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55fcb6d5de84]
>>> >  8: (()+0x82de) [0x7f04c1b1c2de]
>>> >  9: (clone()+0x43) [0x7f04c0853e83]
>>> >
>>> >      0> 2021-08-28T13:53:37.444+0000 7f04a128d700 -1 *** Caught signal
>>> > (Aborted) **
>>> >  in thread 7f04a128d700 thread_name:tp_osd_tp
>>> >
>>> >  ceph version 15.2.7 (88e41c6c49beb18add4fdb6b4326ca466d931db8) octopus
>>> > (stable)
>>> >  1: (()+0x12dd0) [0x7f04c1b26dd0]
>>> >  2: (gsignal()+0x10f) [0x7f04c078f70f]
>>> >  3: (abort()+0x127) [0x7f04c0779b25]
>>> >  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> > const*)+0x1a9) [0x55fcb6621e0f]
>>> >  5: (()+0x504fd8) [0x55fcb6621fd8]
>>> >  6: (OSD::do_recovery(PG*, unsigned int, unsigned long,
>>> > ThreadPool::TPHandle&)+0x5f5) [0x55fcb6704c25]
>>> >  7: (ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*,
>>> > boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1d) [0x55fcb6960a3d]
>>> >  8: (OSD::ShardedOpWQ::_process(unsigned int,
>>> > ceph::heartbeat_handle_d*)+0x12ef) [0x55fcb67224df]
>>> >  9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4)
>>> > [0x55fcb6d5b224]
>>> >  10: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55fcb6d5de84]
>>> >  11: (()+0x82de) [0x7f04c1b1c2de]
>>> >  12: (clone()+0x43) [0x7f04c0853e83]
>>> >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
>>> > to interpret this.
>>> >
>>> >
>>> > Thanks
>>> > Amudhan
>>> > _______________________________________________
>>> > ceph-users mailing list -- ceph-users@xxxxxxx
>>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>> >
>>>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux