Re: OSDs are crashing during PG replication

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Just my quriosity, what had you come up with this file? Probably it
would be help for someone who face similar issue, I guess.

/var/lib/ceph/osd/ceph-4/current/3.2_head/rb.0.19f2e.238e1f29.000000000728__head_813E90A3__3

Cheers,
S

On Thu, Mar 3, 2016 at 3:58 PM, Alexander Gubanov <shtnik@xxxxxxxxx> wrote:
> Nothing of this did't happen. After OSDs fell I found this file
> /var/lib/ceph/osd/ceph-4/current/3.2_head/rb.0.19f2e.238e1f29.000000000728__head_813E90A3__3.
> Location of this file seemed for me is very strange and I just remove it and
> then all osds was started up.
>
>
> On Fri, Feb 26, 2016 at 7:03 PM, Alexey Sheplyakov
> <asheplyakov@xxxxxxxxxxxx> wrote:
>>
>> Alexander,
>>
>> > # ceph osd pool get-quota cache
>> > quotas for pool 'cache':
>> > max objects: N/A
>> > max bytes  : N/A
>> > But I set target_max_bytes:
>> > # ceph osd pool set cache target_max_bytes 1000000000000
>> > Can it serve as the reason?
>>
>> I've been unable to reproduce http://tracker.ceph.com/issues/13098
>> without setting max_bytes.
>> Perhaps you hit a different bug.
>>
>> > Every time 2 of 18 OSDs are crashing.
>>
>> How have they got to that state? Were some of OSDs full/nearly full?
>> Has the cache pool ever reached its target_max_bytes? Anything else
>> which might be relevant?
>>
>> Best regards,
>>       Alexey
>>
>>
>> On Wed, Feb 24, 2016 at 7:36 PM, Alexander Gubanov <shtnik@xxxxxxxxx>
>> wrote:
>> > Hm. It seems that the cache pool qoutas have not been set. At least I'm
>> > sure
>> > I didn't set them.
>> >
>> > # ceph osd pool get-quota cache
>> > quotas for pool 'cache':
>> >   max objects: N/A
>> >   max bytes  : N/A
>> >
>> > Hmm. It seems that the cache pool quota have not been set. At least I'm
>> > sure
>> > I didn't set it. Maybe it have default setting.
>> >
>> > # ceph osd pool get-quota cache
>> > quotas for pool 'cache':
>> >   max objects: N/A
>> >   max bytes  : N/A
>> >
>> > But I set target_max_bytes:
>> >
>> > # ceph osd pool set cache target_max_bytes 1000000000000
>> >
>> > Can it serve as the reason?
>> >
>> > On Wed, Feb 24, 2016 at 4:08 PM, Alexey Sheplyakov
>> > <asheplyakov@xxxxxxxxxxxx> wrote:
>> >>
>> >> Hi,
>> >>
>> >> > 0> 2016-02-24 04:51:45.884445 7fd994825700 -1 osd/ReplicatedPG.cc: In
>> >> > function 'int
>> >> > ReplicatedPG::fill_in_copy_get(ReplicatedPG::OpContext*,
>> >> > ceph::buffer::list::iterator&, OSDOp&, ObjectContextRef&, bool)'
>> >> > thread
>> >> > 7fd994825700 time 2016-02-24 04:51:45.870995
>> >> osd/ReplicatedPG.cc: 5558: FAILED assert(cursor.data_complete)
>> >> > ceph version 0.80.11-8-g95c4287
>> >> > (95c4287b5d24b762bc8538633c5bb2918ecfe4dd)
>> >>
>> >> This one looks familiar: http://tracker.ceph.com/issues/13098
>> >>
>> >> A quick work around is to unset the cache pool quota:
>> >>
>> >> ceph osd pool set-quota $cache_pool_name max_bytes 0
>> >> ceph osd pool set-quota $cache_pool_name max_objects 0
>> >>
>> >> The problem has been properly fixed in infernalis v9.1.0, and
>> >> (partially) in hammer (v0.94.6 which will be released soon).
>> >>
>> >>  Best regards,
>> >>       Alexey
>> >>
>> >>
>> >> On Wed, Feb 24, 2016 at 5:37 AM, Alexander Gubanov <shtnik@xxxxxxxxx>
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > Every time 2 of 18 OSDs are crashing. I think it's happening when run
>> >> > PG
>> >> > replication because crashing only 2 OSDs and every time they're are
>> >> > the
>> >> > same.
>> >> >
>> >> > 0> 2016-02-24 04:51:45.884445 7fd994825700 -1 osd/ReplicatedPG.cc: In
>> >> > function 'int
>> >> > ReplicatedPG::fill_in_copy_get(ReplicatedPG::OpContext*,
>> >> > ceph::buffer::list::iterator&, OSDOp&, ObjectContextRef&, bool)'
>> >> > thread
>> >> > 7fd994825700 time 2016-02-24 04:51:45.870995
>> >> > osd/ReplicatedPG.cc: 5558: FAILED assert(cursor.data_complete)
>> >> >
>> >> >  ceph version 0.80.11-8-g95c4287
>> >> > (95c4287b5d24b762bc8538633c5bb2918ecfe4dd)
>> >> >  1: (ReplicatedPG::fill_in_copy_get(ReplicatedPG::OpContext*,
>> >> > ceph::buffer::list::iterator&, OSDOp&,
>> >> > std::tr1::shared_ptr<ObjectContext>&,
>> >> > bool)+0xffc) [0x7c1f7c]
>> >> >  2: (ReplicatedPG::do_osd_ops(ReplicatedPG::OpContext*,
>> >> > std::vector<OSDOp,
>> >> > std::allocator<OSDOp> >&)+0x4171) [0x809f21]
>> >> >  3:
>> >> > (ReplicatedPG::prepare_transaction(ReplicatedPG::OpContext*)+0x62)
>> >> > [0x814622]
>> >> >  4: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x5f8)
>> >> > [0x815098]
>> >> >  5: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>)+0x3dd4)
>> >> > [0x81a3f4]
>> >> >  6: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>,
>> >> > ThreadPool::TPHandle&)+0x66d) [0x7b4ecd]
>> >> >  7: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
>> >> > std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3a5)
>> >> > [0x600ee5]
>> >> >  8: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>,
>> >> > ThreadPool::TPHandle&)+0x203) [0x61cba3]
>> >> >  9: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>,
>> >> > std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG>
>> >> >>::_void_process(void*, ThreadPool::TPHandle&)+0xac) [0x660f2c]
>> >> >  10: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb20) [0xa7def0]
>> >> >  11: (ThreadPool::WorkThread::entry()+0x10) [0xa7ede0]
>> >> >  12: (()+0x7dc5) [0x7fd9ad03edc5]
>> >> >  13: (clone()+0x6d) [0x7fd9abd2828d]
>> >> >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> >> > needed to
>> >> > interpret this.
>> >> >
>> >> > --- logging levels ---
>> >> >    0/ 5 none
>> >> >    0/ 1 lockdep
>> >> >    0/ 1 context
>> >> >    1/ 1 crush
>> >> >    1/ 5 mds
>> >> >    1/ 5 mds_balancer
>> >> >    1/ 5 mds_locker
>> >> >    1/ 5 mds_log
>> >> >    1/ 5 mds_log_expire
>> >> >    1/ 5 mds_migrator
>> >> >    0/ 1 buffer
>> >> >    0/ 1 timer
>> >> >    0/ 1 filer
>> >> >    0/ 1 striper
>> >> >    0/ 1 objecter
>> >> >    0/ 5 rados
>> >> >    0/ 5 rbd
>> >> >    0/ 5 journaler
>> >> >    0/ 5 objectcacher
>> >> >    0/ 5 client
>> >> >    0/ 5 osd
>> >> >    0/ 5 optracker
>> >> >    0/ 5 objclass
>> >> >    1/ 3 filestore
>> >> >    1/ 3 keyvaluestore
>> >> >    1/ 3 journal
>> >> >    0/ 5 ms
>> >> >    1/ 5 mon
>> >> >    0/10 monc
>> >> >    1/ 5 paxos
>> >> >    0/ 5 tp
>> >> >    1/ 5 auth
>> >> >    1/ 5 crypto
>> >> >    1/ 1 finisher
>> >> >    1/ 5 heartbeatmap
>> >> >    1/ 5 perfcounter
>> >> >    1/ 5 rgw
>> >> >    1/10 civetweb
>> >> >    1/ 5 javaclient
>> >> >    1/ 5 asok
>> >> >    1/ 1 throttle
>> >> >   -2/-2 (syslog threshold)
>> >> >   -1/-1 (stderr threshold)
>> >> >   max_recent     10000
>> >> >   max_new         1000
>> >> >   log_file /var/log/ceph/ceph-osd.3.log
>> >> > --- end dump of recent events ---
>> >> > 2016-02-24 04:51:45.944447 7fd994825700 -1 *** Caught signal
>> >> > (Aborted)
>> >> > **
>> >> >  in thread 7fd994825700
>> >> >
>> >> >  ceph version 0.80.11-8-g95c4287
>> >> > (95c4287b5d24b762bc8538633c5bb2918ecfe4dd)
>> >> >  1: /usr/bin/ceph-osd() [0x9a24f6]
>> >> >  2: (()+0xf100) [0x7fd9ad046100]
>> >> >  3: (gsignal()+0x37) [0x7fd9abc675f7]
>> >> >  4: (abort()+0x148) [0x7fd9abc68ce8]
>> >> >  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fd9ac56b9d5]
>> >> >  6: (()+0x5e946) [0x7fd9ac569946]
>> >> >  7: (()+0x5e973) [0x7fd9ac569973]
>> >> >  8: (()+0x5eb93) [0x7fd9ac569b93]
>> >> >  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> >> > const*)+0x1ef) [0xa8d9df]
>> >> >  10: (ReplicatedPG::fill_in_copy_get(ReplicatedPG::OpContext*,
>> >> > ceph::buffer::list::iterator&, OSDOp&,
>> >> > std::tr1::shared_ptr<ObjectContext>&,
>> >> > bool)+0xffc) [0x7c1f7c]
>> >> >  11: (ReplicatedPG::do_osd_ops(ReplicatedPG::OpContext*,
>> >> > std::vector<OSDOp,
>> >> > std::allocator<OSDOp> >&)+0x4171) [0x809f21]
>> >> >  12:
>> >> > (ReplicatedPG::prepare_transaction(ReplicatedPG::OpContext*)+0x62)
>> >> > [0x814622]
>> >> >  13: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x5f8)
>> >> > [0x815098]
>> >> >  14: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>)+0x3dd4)
>> >> > [0x81a3f4]
>> >> >  15: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>,
>> >> > ThreadPool::TPHandle&)+0x66d) [0x7b4ecd]
>> >> >  16: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
>> >> > std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3a5)
>> >> > [0x600ee5]
>> >> >  17: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>,
>> >> > ThreadPool::TPHandle&)+0x203) [0x61cba3]
>> >> >  18: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>,
>> >> > std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG>
>> >> >>::_void_process(void*, ThreadPool::TPHandle&)+0xac) [0x660f2c]
>> >> >  19: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb20) [0xa7def0]
>> >> >  20: (ThreadPool::WorkThread::entry()+0x10) [0xa7ede0]
>> >> >  21: (()+0x7dc5) [0x7fd9ad03edc5]
>> >> >  22: (clone()+0x6d) [0x7fd9abd2828d]
>> >> >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> >> > needed to
>> >> > interpret this.
>> >> >
>> >> > --- begin dump of recent events ---
>> >> >     -5> 2016-02-24 04:51:45.904559 7fd995026700  5 -- op tracker -- ,
>> >> > seq:
>> >> > 19230, time: 2016-02-24 04:51:45.904559, event: started, request:
>> >> > osd_op(osd.13.12097:806246 rb.0.218d6.238e1f29.000000010db3@snapdir
>> >> > [list-snaps] 3.94c2bed2
>> >> > ack+read+ignore_cache+ignore_overlay+map_snap_clone
>> >> > e13252) v4
>> >> >     -4> 2016-02-24 04:51:45.904598 7fd995026700  1 --
>> >> > 172.16.0.1:6801/419703
>> >> > --> 172.16.0.3:6844/12260 -- osd_op_reply(806246
>> >> > rb.0.218d6.238e1f29.000000010db3 [list-snaps] v0'0 uv27683057 ondisk
>> >> > =
>> >> > 0) v6
>> >> > -- ?+0 0x9f90800 con 0x1b7838c0
>> >> >     -3> 2016-02-24 04:51:45.904616 7fd995026700  5 -- op tracker -- ,
>> >> > seq:
>> >> > 19230, time: 2016-02-24 04:51:45.904616, event: done, request:
>> >> > osd_op(osd.13.12097:806246 rb.0.218d6.238e1f29.000000010db3@snapdir
>> >> > [list-snaps] 3.94c2bed2
>> >> > ack+read+ignore_cache+ignore_overlay+map_snap_clone
>> >> > e13252) v4
>> >> >     -2> 2016-02-24 04:51:45.904637 7fd995026700  5 -- op tracker -- ,
>> >> > seq:
>> >> > 19231, time: 2016-02-24 04:51:45.904637, event: reached_pg, request:
>> >> > osd_op(osd.13.12097:806247 rb.0.218d6.238e1f29.000000010db3 [copy-get
>> >> > max
>> >> > 8388608] 3.94c2bed2
>> >> > ack+read+ignore_cache+ignore_overlay+map_snap_clone
>> >> > e13252) v4
>> >> >     -1> 2016-02-24 04:51:45.904673 7fd995026700  5 -- op tracker -- ,
>> >> > seq:
>> >> > 19231, time: 2016-02-24 04:51:45.904673, event: started, request:
>> >> > osd_op(osd.13.12097:806247 rb.0.218d6.238e1f29.000000010db3 [copy-get
>> >> > max
>> >> > 8388608] 3.94c2bed2
>> >> > ack+read+ignore_cache+ignore_overlay+map_snap_clone
>> >> > e13252) v4
>> >> >      0> 2016-02-24 04:51:45.944447 7fd994825700 -1 *** Caught signal
>> >> > (Aborted) **
>> >> >  in thread 7fd994825700
>> >> >
>> >> >  ceph version 0.80.11-8-g95c4287
>> >> > (95c4287b5d24b762bc8538633c5bb2918ecfe4dd)
>> >> >  1: /usr/bin/ceph-osd() [0x9a24f6]
>> >> >  2: (()+0xf100) [0x7fd9ad046100]
>> >> >  3: (gsignal()+0x37) [0x7fd9abc675f7]
>> >> >  4: (abort()+0x148) [0x7fd9abc68ce8]
>> >> >  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fd9ac56b9d5]
>> >> >  6: (()+0x5e946) [0x7fd9ac569946]
>> >> >  7: (()+0x5e973) [0x7fd9ac569973]
>> >> >  8: (()+0x5eb93) [0x7fd9ac569b93]
>> >> >  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> >> > const*)+0x1ef) [0xa8d9df]
>> >> >  10: (ReplicatedPG::fill_in_copy_get(ReplicatedPG::OpContext*,
>> >> > ceph::buffer::list::iterator&, OSDOp&,
>> >> > std::tr1::shared_ptr<ObjectContext>&,
>> >> > bool)+0xffc) [0x7c1f7c]
>> >> >  11: (ReplicatedPG::do_osd_ops(ReplicatedPG::OpContext*,
>> >> > std::vector<OSDOp,
>> >> > std::allocator<OSDOp> >&)+0x4171) [0x809f21]
>> >> >  12:
>> >> > (ReplicatedPG::prepare_transaction(ReplicatedPG::OpContext*)+0x62)
>> >> > [0x814622]
>> >> >  13: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x5f8)
>> >> > [0x815098]
>> >> >  14: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>)+0x3dd4)
>> >> > [0x81a3f4]
>> >> >  15: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>,
>> >> > ThreadPool::TPHandle&)+0x66d) [0x7b4ecd]
>> >> >  16: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
>> >> > std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3a5)
>> >> > [0x600ee5]
>> >> >  17: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>,
>> >> > ThreadPool::TPHandle&)+0x203) [0x61cba3]
>> >> >  18: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>,
>> >> > std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG>
>> >> >>::_void_process(void*, ThreadPool::TPHandle&)+0xac) [0x660f2c]
>> >> >  19: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb20) [0xa7def0]
>> >> >  20: (ThreadPool::WorkThread::entry()+0x10) [0xa7ede0]
>> >> >  21: (()+0x7dc5) [0x7fd9ad03edc5]
>> >> >  22: (clone()+0x6d) [0x7fd9abd2828d]
>> >> >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> >> > needed to
>> >> > interpret this.
>> >> >
>> >> > --- logging levels ---
>> >> >    0/ 5 none
>> >> >    0/ 1 lockdep
>> >> >    0/ 1 context
>> >> >    1/ 1 crush
>> >> >    1/ 5 mds
>> >> >    1/ 5 mds_balancer
>> >> >    1/ 5 mds_locker
>> >> >    1/ 5 mds_log
>> >> >    1/ 5 mds_log_expire
>> >> >    1/ 5 mds_migrator
>> >> >    0/ 1 buffer
>> >> >    0/ 1 timer
>> >> >    0/ 1 filer
>> >> >    0/ 1 striper
>> >> >    0/ 1 objecter
>> >> >    0/ 5 rados
>> >> >    0/ 5 rbd
>> >> >    0/ 5 journaler
>> >> >    0/ 5 objectcacher
>> >> >    0/ 5 client
>> >> >    0/ 5 osd
>> >> >    0/ 5 optracker
>> >> >    0/ 5 objclass
>> >> >    1/ 3 filestore
>> >> >    1/ 3 keyvaluestore
>> >> >    1/ 3 journal
>> >> >    0/ 5 ms
>> >> >    1/ 5 mon
>> >> >    0/10 monc
>> >> >    1/ 5 paxos
>> >> >    0/ 5 tp
>> >> >    1/ 5 auth
>> >> >    1/ 5 crypto
>> >> >    1/ 1 finisher
>> >> >    1/ 5 heartbeatmap
>> >> >    1/ 5 perfcounter
>> >> >    1/ 5 rgw
>> >> >    1/10 civetweb
>> >> >    1/ 5 javaclient
>> >> >    1/ 5 asok
>> >> >    1/ 1 throttle
>> >> >   -2/-2 (syslog threshold)
>> >> >   -1/-1 (stderr threshold)
>> >> >   max_recent     10000
>> >> >   max_new         1000
>> >> >   log_file /var/log/ceph/ceph-osd.3.log
>> >> > --- end dump of recent events ---
>> >> >
>> >> > --
>> >> > Alexander Gubanov
>> >> >
>> >> > _______________________________________________
>> >> > ceph-users mailing list
>> >> > ceph-users@xxxxxxxxxxxxxx
>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >
>> >
>> >
>> >
>> >
>> > --
>> > Alexander Gubanov
>> >
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@xxxxxxxxxxxxxx
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>
>
>
>
> --
> Alexander Gubanov
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Email:
shinobu@xxxxxxxxx
GitHub:
shinobu-x
Blog:
Life with Distributed Computational System based on OpenSource
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux