Hi all, I deployed a multi-site cluster in order to sync object from an old cluster to a brand new cluster. It seems good cause I can see the data syncing. However, when I check the cluster health, it shows warn messages "2 daemons have recently crashed". I get the crash info by 'sudo ceph crash info $id': { "os_version_id": "7", "utsname_release": "3.10.0-957.27.2.el7.x86_64", "os_name": "CentOS Linux", "entity_name": "client.rgw.ceph-node7", "timestamp": "2020-05-09 15:17:59.482502Z", "process_name": "radosgw", "utsname_machine": "x86_64", "utsname_sysname": "Linux", "os_version": "7 (Core)", "os_id": "centos", "utsname_version": "#1 SMP Mon Jul 29 17:46:05 UTC 2019", "backtrace": [ "(()+0xf5f0) [0x7f32b1bdf5f0]", "(RGWCoroutine::set_sleeping(bool)+0xc) [0x555eeb1351ac]", "(RGWOmapAppend::flush_pending()+0x2d) [0x555eeb13acad]", "(RGWOmapAppend::finish()+0x10) [0x555eeb13acd0]", "(RGWDataSyncShardCR::stop_spawned_services()+0x2b) [0x555eeb0a185b]", "(RGWDataSyncShardCR::incremental_sync()+0x72a) [0x555eeb0a9baa]", "(RGWDataSyncShardCR::operate()+0x9d) [0x555eeb0ab33d]", "(RGWCoroutinesStack::operate(RGWCoroutinesEnv*)+0x60) [0x555eeb136520]", "(RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x236) [0x555eeb137196]", "(RGWCoroutinesManager::run(RGWCoroutine*)+0x78) [0x555eeb138098]", "(RGWRemoteDataLog::run_sync(int)+0x1cf) [0x555eeb08851f]", "(RGWDataSyncProcessorThread::process()+0x46) [0x555eeb1e71a6]", "(RGWRadosThread::Worker::entry()+0x115) [0x555eeb1b6195]", "(()+0x7e65) [0x7f32b1bd7e65]", "(clone()+0x6d) [0x7f32b10e188d]" ], "utsname_hostname": "ceph-node7", "crash_id": "2020-05-09_15:17:59.482502Z_b80d7bee-faa0-4d2f-9d86-a1b3f4d4802e", "ceph_version": "14.2.8" } AND { "os_version_id": "7", "utsname_release": "3.10.0-957.27.2.el7.x86_64", "os_name": "CentOS Linux", "entity_name": "client.rgw.ceph-node7", "timestamp": "2020-05-10 16:23:13.375063Z", "process_name": "radosgw", "utsname_machine": "x86_64", "utsname_sysname": "Linux", "os_version": "7 (Core)", "os_id": "centos", "utsname_version": "#1 SMP Mon Jul 29 17:46:05 UTC 2019", "backtrace": [ "(()+0xf5f0) [0x7f409f42e5f0]", "(RGWCoroutine::set_sleeping(bool)+0xc) [0x55e3f45e01ac]", "(RGWOmapAppend::flush_pending()+0x2d) [0x55e3f45e5cad]", "(RGWOmapAppend::finish()+0x10) [0x55e3f45e5cd0]", "(RGWDataSyncShardCR::stop_spawned_services()+0x2b) [0x55e3f454c85b]", "(RGWDataSyncShardCR::incremental_sync()+0x72a) [0x55e3f4554baa]", "(RGWDataSyncShardCR::operate()+0x9d) [0x55e3f455633d]", "(RGWCoroutinesStack::operate(RGWCoroutinesEnv*)+0x60) [0x55e3f45e1520]", "(RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x236) [0x55e3f45e2196]", "(RGWCoroutinesManager::run(RGWCoroutine*)+0x78) [0x55e3f45e3098]", "(RGWRemoteDataLog::run_sync(int)+0x1cf) [0x55e3f453351f]", "(RGWDataSyncProcessorThread::process()+0x46) [0x55e3f46921a6]", "(RGWRadosThread::Worker::entry()+0x115) [0x55e3f4661195]", "(()+0x7e65) [0x7f409f426e65]", "(clone()+0x6d) [0x7f409e93088d]" ], "utsname_hostname": "ceph-node7", "crash_id": "2020-05-10_16:23:13.375063Z_9e70a0c0-929e-445f-b4cd-8d29e909fe2f", "ceph_version": "14.2.8" } So I fetch and check the file "ceph-client.rgw.ceph-node7.log". The log has huge amount of errors like: -732> 2020-05-09 23:17:53.476 7f328b7ff700 0 RGW-SYNC:data:sync:shard[98]:entry[harbor-registry:f70a5eb9-d88d-42fd-ab4e-d300e97094de.4620.94:23]:bucket[harbor-registry:f70a5eb9-d88d-42fd-ab4e-d300e97094de.4620.94:23]:inc_sync[harbor-registry:f70a5 eb9-d88d-42fd-ab4e-d300e97094de.4620.94:23]: ERROR: lease is not taken, abort AND -723> 2020-05-09 23:17:56.388 7f328b7ff700 5 RGW-SYNC:data:sync:shard[88]:entry[harbor-registry:f70a5eb9-d88d-42fd-ab4e-d300e97094de.4620.94:13]:bucket[harbor-registry:f70a5eb9-d88d-42fd-ab4e-d300e97094de.4620.94:13]: incremental sync on bucket fa iled, retcode=-125 AND -215> 2020-05-09 23:17:58.809 7f328b7ff700 5 RGW-SYNC:data:sync:shard[10]:entry[pf2-harbor-swift:f70a5eb9-d88d-42fd-ab4e-d300e97094de.4608.101:113]:bucket[pf2-harbor-swift:f70a5eb9-d88d-42fd-ab4e-d300e97094de.4608.101:113]: full sync on bucket failed, retcode=-125 AND 2020-05-09 23:18:24.048 7f4085867700 1 robust_notify: If at first you don't succeed: (110) Connection timed out 2020-05-09 23:18:24.048 7f4083863700 0 ERROR: failed to distribute cache for shubei.rgw.log:datalog.sync-status.shard.f70a5eb9-d88d-42fd-ab4e-d300e97094de.5 2020-05-09 23:28:49.181 7f407e859700 1 heartbeat_map reset_timeout 'RGWAsyncRadosProcessor::m_tp thread 0x7f407e859700' had timed out after 600 2020-05-10 03:12:01.905 7f409708a700 -1 received signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror And finally it crashed. I'm not sure where the problem is. Were the crashes caused by the network? Thanks _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx