can't start osd- one osd always be down

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear everyone

I can't start osd.555,
some pgs can't be repair. I'm using replicate 3 for my data pool.
Feel some objects in those pgs be failed,

I tried to delete some data that related above objects, but still not
start osd.555

and, removed osd.555, but other osds (eg: osd.532 down, not start
osd.532) when object recovering;


I find exception message about  "copy_subset size is so huge"


MOSDPGPull(4.193c 295912
[PullOp(6e17193c/rbd_data.7acbea7d555b97.0000000000006b69/head//4,
recovery_info: ObjectRecoveryInfo(6e17193c/rbd_data.7acbea7d555b97.0000000000006b69/head//4@293776'14739967,
copy_subset: [0~18446744073709551615], clone_subset: {}),
recovery_progress: ObjectRecoveryProgress(first, data_recovered_to:0,
data_complete:false, omap_recovered_to:,
omap_complete:false)),PullOp(f159993c/rbd_data.1bece115bba4049.00000000001fa2cc/head//4,
recovery_info: ObjectRecoveryInfo(f159993c/rbd_data.1bece115bba4049.00000000001fa2cc/head//4@294153'14739969,
copy_subset: [0~18446744073709551615], clone_subset: {}),
recovery_progress: ObjectRecoveryProgress(first, data_recovered_to:0,
data_complete:false, omap_recovered_to:, omap_complete:false))]) v2 --
?+0 0x34a72760 con 0x330b2c00

# grep  18446744073709551615 /var/log/ceph/ceph-osd.532.log  --color |wc -l
122


interval_set<uint64_t> copy_subset;


maybe it is triggered by data struct flooding?


Guide me to debug it, please! Thanks!


relevant info below:
    ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6)

    cluster ffa2090c-1bd0-4f8e-973d-b0d4ddf9c2d8
     health HEALTH_WARN 680 pgs degraded; 680 pgs stuck unclean; 42
requests are blocked > 32 sec; recovery 150458/29608490 objects
degraded (0.508%); 1/690 in osds are down; noout,noscrub,nodeep-scrub
flag(s) set
     monmap e11: 5 mons at {BJ-M1-Cloud71=XXX}, election epoch 39138,
quorum 0,1,2,3,4
BJ-M1-Cloud71,BJ-M1-Cloud73,BJ-M2-Cloud80,BJ-M2-Cloud81,BJ-M3-Cloud85
     osdmap e296072: 719 osds: 689 up, 690 in
            flags noout,noscrub,nodeep-scrub
      pgmap v127916572: 71504 pgs, 8 pools, 71837 GB data, 14457 kobjects
            140 TB used, 721 TB / 862 TB avail
            150458/29608490 objects degraded (0.508%)
               70824 active+clean
                 680 active+degraded
  client io 100 kB/s rd, 4203 kB/s wr, 652 op/s

ceph.532-log:

  -10> 2018-08-22 12:32:26.771645 7f451d916700  5 -- op tracker -- ,
seq: 2846, time: 2018-08-22 12:32:26.771591, event: reached_pg,
request: MOSDPGPush(4.5d2e 295912
[PushOp(bfc1dd2e/rbd_data.183b85e601d1026.0000000000000212/head//4,
version: 293767'23828530, data_included: [], data_size: 0,
omap_header_size: 0, omap_entries_size: 0, attrset_size: 2,
recovery_info: ObjectRecoveryInfo(bfc1dd2e/rbd_data.183b85e601d1026.0000000000000212/head//4@293767'23828530,
copy_subset: [], clone_subset: {}), after_progress:
ObjectRecoveryProgress(!first, data_recovered_to:0,
data_complete:true, omap_recovered_to:, omap_complete:true),
before_progress: ObjectRecoveryProgress(first, data_recovered_to:0,
data_complete:false, omap_recovered_to:, omap_complete:false))]) v2
    -9> 2018-08-22 12:32:26.771672 7f451b112700  5 -- op tracker -- ,
seq: 2848, time: 2018-08-22 12:32:26.771645, event: reached_pg,
request: MOSDPGPush(4.1fa 295912
[PushOp(66881fa/rbd_data.320a2da4111f069.00000000000074cd/head//4,
version: 289249'30699865, data_included: [0~4194304], data_size:
4194304, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2,
recovery_info: ObjectRecoveryInfo(66881fa/rbd_data.320a2da4111f069.00000000000074cd/head//4@289249'30699865,
copy_subset: [0~4194304], clone_subset: {}), after_progress:
ObjectRecoveryProgress(!first, data_recovered_to:4194304,
data_complete:true, omap_recovered_to:, omap_complete:true),
before_progress: ObjectRecoveryProgress(first, data_recovered_to:0,
data_complete:false, omap_recovered_to:, omap_complete:false))]) v2
    -8> 2018-08-22 12:32:26.771735 7f451d916700  5 -- op tracker -- ,
seq: 2846, time: 2018-08-22 12:32:26.771735, event: done, request:
MOSDPGPush(4.5d2e 295912
[PushOp(bfc1dd2e/rbd_data.183b85e601d1026.0000000000000212/head//4,
version: 293767'23828530, data_included: [], data_size: 0,
omap_header_size: 0, omap_entries_size: 0, attrset_size: 2,
recovery_info: ObjectRecoveryInfo(bfc1dd2e/rbd_data.183b85e601d1026.0000000000000212/head//4@293767'23828530,
copy_subset: [], clone_subset: {}), after_progress:
ObjectRecoveryProgress(!first, data_recovered_to:0,
data_complete:true, omap_recovered_to:, omap_complete:true),
before_progress: ObjectRecoveryProgress(first, data_recovered_to:0,
data_complete:false, omap_recovered_to:, omap_complete:false))]) v2
    -7> 2018-08-22 12:32:26.771769 7f451ed18700  5 -- op tracker -- ,
seq: 2847, time: 2018-08-22 12:32:26.771740, event: reached_pg,
request: MOSDPGPush(4.5d2e 295912
[PushOp(ff9f5d2e/rbd_data.183b85e601d1026.0000000000004a8d/head//4,
version: 293767'23828531, data_included: [], data_size: 0,
omap_header_size: 0, omap_entries_size: 0, attrset_size: 2,
recovery_info: ObjectRecoveryInfo(ff9f5d2e/rbd_data.183b85e601d1026.0000000000004a8d/head//4@293767'23828531,
copy_subset: [], clone_subset: {}), after_progress:
ObjectRecoveryProgress(!first, data_recovered_to:0,
data_complete:true, omap_recovered_to:, omap_complete:true),
before_progress: ObjectRecoveryProgress(first, data_recovered_to:0,
data_complete:false, omap_recovered_to:, omap_complete:false))]) v2
    -6> 2018-08-22 12:32:26.771798 7f451b112700  5 -- op tracker -- ,
seq: 2848, time: 2018-08-22 12:32:26.771774, event: done, request:
MOSDPGPush(4.1fa 295912
[PushOp(66881fa/rbd_data.320a2da4111f069.00000000000074cd/head//4,
version: 289249'30699865, data_included: [0~4194304], data_size: 0,
omap_header_size: 0, omap_entries_size: 0, attrset_size: 2,
recovery_info: ObjectRecoveryInfo(66881fa/rbd_data.320a2da4111f069.00000000000074cd/head//4@289249'30699865,
copy_subset: [0~4194304], clone_subset: {}), after_progress:
ObjectRecoveryProgress(!first, data_recovered_to:4194304,
data_complete:true, omap_recovered_to:, omap_complete:true),
before_progress: ObjectRecoveryProgress(first, data_recovered_to:0,
data_complete:false, omap_recovered_to:, omap_complete:false))]) v2
    -5> 2018-08-22 12:32:26.771867 7f451ed18700  5 -- op tracker -- ,
seq: 2847, time: 2018-08-22 12:32:26.771867, event: done, request:
MOSDPGPush(4.5d2e 295912
[PushOp(ff9f5d2e/rbd_data.183b85e601d1026.0000000000004a8d/head//4,
version: 293767'23828531, data_included: [], data_size: 0,
omap_header_size: 0, omap_entries_size: 0, attrset_size: 2,
recovery_info: ObjectRecoveryInfo(ff9f5d2e/rbd_data.183b85e601d1026.0000000000004a8d/head//4@293767'23828531,
copy_subset: [], clone_subset: {}), after_progress:
ObjectRecoveryProgress(!first, data_recovered_to:0,
data_complete:true, omap_recovered_to:, omap_complete:true),
before_progress: ObjectRecoveryProgress(first, data_recovered_to:0,
data_complete:false, omap_recovered_to:, omap_complete:false))]) v2
    -4> 2018-08-22 12:32:26.773676 7f4525b23700  1 -- xxx<== osd.495
XXX 12 ==== osd_ping(ping e295912 stamp 2018-08-22 12:32:26.762150) v2
==== 47+0+0 (2232059652 0 0) 0x357ae040 con 0x34b05ee0
    -
     0> 2018-08-22 12:32:26.774276 7f451610a700 -1 *** Caught signal
(Segmentation fault) **
 in thread 7f451610a700


 ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6)
 1: /usr/bin/ceph-osd() [0x9acb51]
 2: /lib64/libpthread.so.0() [0x3ee680f7e0]
 3: (ReplicatedPG::trim_object(hobject_t const&)+0x2e6) [0x854196]
 4: (ReplicatedPG::TrimmingObjects::react(ReplicatedPG::SnapTrim
const&)+0x73c) [0x856b5c]
 5: (boost::statechart::simple_state<ReplicatedPG::TrimmingObjects,
ReplicatedPG::SnapTrimmer, boost::mpl::list<mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
const&, void const*)+0xa8) [0x8b4f38]
 6: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer,
ReplicatedPG::NotTrimming, std::allocator<void>,
boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
const&)+0x12f) [0x8a3ebf]
 7: (ReplicatedPG::snap_trimmer()+0x5b0) [0x80f6b0]
 8: (OSD::SnapTrimWQ::_process(PG*)+0x1d) [0x664dbd]
 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0x551) [0x9beac1]
 10: (ThreadPool::WorkThread::entry()+0x10) [0x9c1b00]
 11: /lib64/libpthread.so.0() [0x3ee6807aa1]
 12: (clone()+0x6d) [0x3ee5ce893d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux