Hello, So, I need maybe some advices : 1 week ago (last 19
feb), I upgraded my stable Ceph Jewel from 10.2.3 to 10.2.5
(YES, It was maybe a bad idea). I never had problem with Ceph 10.2.3 since last
upgrade, last 23 September. So since my upgrade (10.2.5), every 2 days, the
first OSD server totaly Freeze. Load go > 500 and come back
after somes minutes… I lost all OSD from this server (12/36)
during issue. It’s very strange: So, some informations : Infrastructure: 3 x OSD servers with 12 x OSD disk each and SSD
Journal + 3 Mon server + 3 clients Ceph - RBD. 10G dedicated network for client and 10G dedicated
networks for OSD. So 36 x OSD. Each server has 16 CPU core
(E5-2630v3x2) and 32G Ram. No problem with resources. Performance is good for 36 x NL-SAS DISK 4To + 1
SSD write intensiv per OSD-server. Issue: This morning (last issue was 2 days ago): See screenshot :
http://www.performance-conseil-informatique.net/wp-content/uploads/2017/03/screenshot_LOAD-1.png As you can see, there are few IO (just 2 clients,
writing sometime 150Mo/s during few minutes) – It’s a big NAS
for cold Data. So during issue, there was no IO: it's strange. Same for other issue. See screenshot :
http://www.performance-conseil-informatique.net/wp-content/uploads/2017/03/screenshot_OSD_IO.png Before issue: no activity. You can see all OSD
READ, OSD Write, Journal (SSD), IO wait. 7 :07=>7 :09. 2 minutes with 12/36 OSD totaly
lost. It come back after, but I need to fix that. During time of issue, scrub is stopped as well, Trim night was finished… no IO. No other cron on server, nothing. all server
have same configuration. LOGS : A
lot : ceph-osd.3.log:2017-03-02
07:09:32.061754 7f6d501e4700 -1 osd.3 14557 heartbeat_check: no
reply from 0x7f6dadb48c10 osd.19 since back 2017-03-02
07:07:53.286880 front 2017-03-02 07:07:53.286880 (cutoff
2017-03-02 07:09:12.061690) Sometime: common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit
suicide timeout") ceph version
10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367) 1:
(ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0x7fc38a5e9425] 2:
(ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*,
char const*, long)+0x2e1) [0x7fc38a528de1] 3:
(ceph::HeartbeatMap::is_healthy()+0xde) [0x7fc38a52963e] 4:
(ceph::HeartbeatMap::check_touch_file()+0x2c) [0x7fc38a529e1c] 5:
(CephContextServiceThread::entry()+0x15b) [0x7fc38a6011ab] 6:
(()+0x7dc5) [0x7fc388304dc5] 7:
(clone()+0x6d) [0x7fc38698f73d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Already 4 freezes from the last upgrade. I will
today modify log level and restart all to have more logs. Any idea to troubleshoot ? (I already use sar statistics to find something…). Maybe some change with heartbeat ? Should I think to downgrade to 10.2.3 ? upgrade to Kraken ? Thanks for your help, Regards, rpm -qa|grep ceph libcephfs1-10.2.5-0.el7.x86_64 ceph-common-10.2.5-0.el7.x86_64 ceph-mon-10.2.5-0.el7.x86_64 ceph-release-1-1.el7.noarch ceph-10.2.5-0.el7.x86_64 ceph-radosgw-10.2.5-0.el7.x86_64 ceph-selinux-10.2.5-0.el7.x86_64 ceph-mds-10.2.5-0.el7.x86_64 python-cephfs-10.2.5-0.el7.x86_64 ceph-base-10.2.5-0.el7.x86_64 ceph-osd-10.2.5-0.el7.x86_64 uname -a Linux ceph-osd-03 3.10.0-514.6.2.el7.x86_64 #1 SMP
Thu Feb 23 03:04:39 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Ceph conf : [global] fsid = d26f269b-852f-4181-821d-756f213ae155 mon_initial_members = ceph-mon-01, ceph-mon-02,
ceph-mon-03 mon_host =
192.168.43.147,192.168.43.148,192.168.43.149 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx max_open_files = 131072 public_network = 192.168.43.0/24 cluster_network = 192.168.44.0/24 osd_journal_size = 13000 osd_pool_default_size = 2 # Write an object n
times. osd_pool_default_min_size = 2 # Allow writing n
copy in a degraded state. osd_pool_default_pg_num = 512 osd_pool_default_pgp_num = 512 osd_crush_chooseleaf_type = 8 cephx_cluster_require_signatures = true cephx_service_require_signatures = false mon_pg_warn_max_object_skew = 0 mon_pg_warn_max_per_osd = 0 [mon] [osd] osd_max_backfills = 1 osd_recovery_priority = 3 osd_recovery_max_active = 3 osd_recovery_max_start = 3 filestore merge threshold = 40 filestore split multiple = 8 filestore xattr use omap = true osd op threads = 8 osd disk threads = 4 osd op num threads per shard = 3 osd op num shards = 10 osd map cache size = 1024 osd_enable_op_tracker = false osd_scrub_begin_hour = 20 osd_scrub_end_hour = 6 [client] rbd_cache = true rbd cache size = 67108864 rbd cache max dirty = 50331648 rbd cache target dirty = 33554432 rbd cache max dirty age = 2 rbd cache writethrough until flush = true rbd readahead trigger requests = 10 # number of
sequential requests necessary to trigger readahead. rbd readahead max bytes = 524288 # maximum size of
a readahead request, in bytes. rbd readahead disable after bytes = 52428800 --
|
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com