Ceph S3 multisite replication issue

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

 

 

 

I'm using ceph 12.2.10 on debian stretch.

 

I have two clusters on two different datacenters interconnected with a ~ 7ms latency link.

 

 

 

I setup S3 replication between those DC and it works fine except when I enable SSL.

 

 

 

My setup is the following:

 

- 2 radosgw on each site

 

- Nginx in front of each radosgw to handle SSL termination (I use also Nginx when replication flow is not encrypted)

 

- 3 GSLB: storage.mydomain.local, storage-dc1.mydomain.local, storage-dc2.mydomain.local

 

 

 

With or without SSL, the replication is working but when I enable SSL after some time (1 hour in average) radosgw on the replicated site have their CPU which increase up to 100% in about a minute or so.

 

Looking at logs, it seems to loop against so oprations to complete:

 

 

 

2018-12-06 10:25:36.743088 7f48f30bb700 20 cr:s=0x563c186c9590:op=0x563bfa943800:20RGWContinuousLeaseCR: operate()

2018-12-06 10:25:36.743108 7f48f30bb700 20 cr:s=0x563c186c9590:op=0x563bf211e300:20RGWSimpleRadosLockCR: operate()

2018-12-06 10:25:36.743109 7f48f30bb700 20 cr:s=0x563c186c9590:op=0x563bf211e300:20RGWSimpleRadosLockCR: operate()

2018-12-06 10:25:36.743119 7f48f30bb700 20 enqueued request req=0x563bf638c600

2018-12-06 10:25:36.743120 7f48f30bb700 20 RGWWQ:

2018-12-06 10:25:36.743121 7f48f30bb700 20 req: 0x563bf638c600

2018-12-06 10:25:36.743124 7f48f30bb700 20 run: stack=0x563c186c9590 is io blocked

2018-12-06 10:25:36.743173 7f48f92cd700 20 dequeued request req=0x563bf638c600

2018-12-06 10:25:36.743176 7f48f92cd700 20 RGWWQ: empty

2018-12-06 10:25:36.748138 7f48f30bb700 20 cr:s=0x563c186c9590:op=0x563bf211e300:20RGWSimpleRadosLockCR: operate()

2018-12-06 10:25:36.748154 7f48f30bb700 20 cr:s=0x563c186c9590:op=0x563bf211e300:20RGWSimpleRadosLockCR: operate()

2018-12-06 10:25:36.748155 7f48f30bb700 20 cr:s=0x563c186c9590:op=0x563bf211e300:20RGWSimpleRadosLockCR: operate()

2018-12-06 10:25:36.748156 7f48f30bb700 20 cr:s=0x563c186c9590:op=0x563bf211e300:20RGWSimpleRadosLockCR: operate()

2018-12-06 10:25:36.748161 7f48f30bb700 20 cr:s=0x563c186c9590:op=0x563bfa943800:20RGWContinuousLeaseCR: operate()

2018-12-06 10:25:36.748169 7f48f30bb700 20 run: stack=0x563c186c9590 is io blocked

2018-12-06 10:25:37.824409 7f48f30bb700 20 cr:s=0x563c0f3a2690:op=0x563bf72d3000:20RGWContinuousLeaseCR: operate()

2018-12-06 10:25:37.824425 7f48f30bb700 20 cr:s=0x563c0f3a2690:op=0x563bf211e300:20RGWSimpleRadosLockCR: operate()

2018-12-06 10:25:37.824427 7f48f30bb700 20 cr:s=0x563c0f3a2690:op=0x563bf211e300:20RGWSimpleRadosLockCR: operate()

2018-12-06 10:25:37.824440 7f48f30bb700 20 enqueued request req=0x563bf638c600

2018-12-06 10:25:37.824442 7f48f30bb700 20 RGWWQ:

2018-12-06 10:25:37.824442 7f48f30bb700 20 req: 0x563bf638c600

2018-12-06 10:25:37.824447 7f48f30bb700 20 run: stack=0x563c0f3a2690 is io blocked

2018-12-06 10:25:37.824528 7f48fead8700 20 dequeued request req=0x563bf638c600

2018-12-06 10:25:37.824531 7f48fead8700 20 RGWWQ: empty

2018-12-06 10:25:37.826461 7f48f30bb700 20 cr:s=0x563c0f3a4ee0:op=0x563bf78d9800:20RGWContinuousLeaseCR: operate()

2018-12-06 10:25:37.826474 7f48f30bb700 20 cr:s=0x563c0f3a4ee0:op=0x563bf7633c00:20RGWSimpleRadosLockCR: operate()

2018-12-06 10:25:37.826476 7f48f30bb700 20 cr:s=0x563c0f3a4ee0:op=0x563bf7633c00:20RGWSimpleRadosLockCR: operate()

2018-12-06 10:25:37.826485 7f48f30bb700 20 enqueued request req=0x563bf28d6a00

2018-12-06 10:25:37.826487 7f48f30bb700 20 RGWWQ:

2018-12-06 10:25:37.826487 7f48f30bb700 20 req: 0x563bf28d6a00

2018-12-06 10:25:37.826492 7f48f30bb700 20 run: stack=0x563c0f3a4ee0 is io blocked

2018-12-06 10:25:37.826569 7f48ffada700 20 dequeued request req=0x563bf28d6a00

2018-12-06 10:25:37.826574 7f48ffada700 20 RGWWQ: empty

2018-12-06 10:25:37.827819 7f48f30bb700 20 cr:s=0x563c0f3a2690:op=0x563bf211e300:20RGWSimpleRadosLockCR: operate()

2018-12-06 10:25:37.827826 7f48f30bb700 20 cr:s=0x563c0f3a2690:op=0x563bf211e300:20RGWSimpleRadosLockCR: operate()

2018-12-06 10:25:37.827827 7f48f30bb700 20 cr:s=0x563c0f3a2690:op=0x563bf211e300:20RGWSimpleRadosLockCR: operate()

2018-12-06 10:25:37.827828 7f48f30bb700 20 cr:s=0x563c0f3a2690:op=0x563bf211e300:20RGWSimpleRadosLockCR: operate()

2018-12-06 10:25:37.827837 7f48f30bb700 20 cr:s=0x563c0f3a2690:op=0x563bf72d3000:20RGWContinuousLeaseCR: operate()

2018-12-06 10:25:37.827844 7f48f30bb700 20 run: stack=0x563c0f3a2690 is io blocked

2018-12-06 10:25:37.829124 7f48f30bb700 20 cr:s=0x563c0f3a4ee0:op=0x563bf7633c00:20RGWSimpleRadosLockCR: operate()

2018-12-06 10:25:37.829132 7f48f30bb700 20 cr:s=0x563c0f3a4ee0:op=0x563bf7633c00:20RGWSimpleRadosLockCR: operate()

2018-12-06 10:25:37.829134 7f48f30bb700 20 cr:s=0x563c0f3a4ee0:op=0x563bf7633c00:20RGWSimpleRadosLockCR: operate()

2018-12-06 10:25:37.829134 7f48f30bb700 20 cr:s=0x563c0f3a4ee0:op=0x563bf7633c00:20RGWSimpleRadosLockCR: operate()

2018-12-06 10:25:37.829141 7f48f30bb700 20 cr:s=0x563c0f3a4ee0:op=0x563bf78d9800:20RGWContinuousLeaseCR: operate()

2018-12-06 10:25:37.829147 7f48f30bb700 20 run: stack=0x563c0f3a4ee0 is io blocked

 

 

 

I have the same behavior on a new cluster or on a single cluster migrated to multisite.

 

I have tested multiple radosgw configurations (rgw_curl*) but not very concluding.

 

Any thoughts ?

 

 

 

Thanks in advance.

 

 

 

Rémi

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux