radosgw multisite replication segfaults on init in 13.2.6

Płaza Tomasz <Tomasz.Plaza@xxxxxxxxxx> · Fri, 14 Jun 2019 11:44:43 +0000

Hi,

We have a standalone ceph cluster v13.2.6 and wanted to replicate it to another DC. After going through "Migrating a Single Site
System to Multi-Site" and "Configure a Secondary Zone" from http://docs.ceph.com/docs/master/radosgw/multisite/, We have setted
up all buckets to "disable replication" and started replication. To our suprise after a few minutes from start a new pools named
default.rgw.buckets.{index,data} appeared and started getting data.

There was a data split in the indexes pool, like below:
    dc2_zone.rgw.control           35         0 B         0       118 TiB           8
    dc2_zone.rgw.meta              36     714 KiB         0       118 TiB        2895
    dc2_zone.rgw.log               37      14 KiB         0       118 TiB         734
    dc2_zone.rgw.buckets.index     38         0 B         0       565 GiB        7203
    default.rgw.buckets.index      39         0 B         0       565 GiB        4204
    dc2_zone.rgw.buckets.data      40     933 MiB         0       118 TiB        2605
Idexes on a secondary pool was inconsistent.

In logs from radosgw setted as an enpoint for secondary zone we found those lines:

-10001> 2019-06-14 11:41:45.701 7f46f0959700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f46f0959700 thread_name:data-sync
ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
1: (()+0xf5d0) [0x7f4739e1c5d0]
2: (RGWCoroutine::set_sleeping(bool)+0xc) [0x5561c28ffe0c]
3: (RGWOmapAppend::flush_pending()+0x2d) [0x5561c2904e1d]
4: (RGWOmapAppend::finish()+0x10) [0x5561c2904f00]
5: (RGWDataSyncShardCR::stop_spawned_services()+0x30) [0x5561c2b44320]
6: (RGWDataSyncShardCR::incremental_sync()+0x4c6) [0x5561c2b5d736]
7: (RGWDataSyncShardCR::operate()+0x75) [0x5561c2b5f0e5]
8: (RGWCoroutinesStack::operate(RGWCoroutinesEnv*)+0x46) [0x5561c28fd566]
9: (RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x293) [0x5561c2900233]
10: (RGWCoroutinesManager::run(RGWCoroutine*)+0x78) [0x5561c2901108]
11: (RGWRemoteDataLog::run_sync(int)+0x1e7) [0x5561c2b36d37]
12: (RGWDataSyncProcessorThread::process()+0x46) [0x5561c29bacb6]
13: (RGWRadosThread::Worker::entry()+0x22b) [0x5561c295c4cb]
14: (()+0x7dd5) [0x7f4739e14dd5]
15: (clone()+0x6d) [0x7f472e306ead]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Right now We workaround this by setting pool names in secondary zone to default.* and everything looks fine so We are gradually
enabling replication for other buckets and We are observing situation.

Has anyone seen a similar beahaviour?

Best Regards,
Tomasz Płaza
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com