Hi!
I created a tracker ticket:
http://tracker.ceph.com/issues/22721
It also happens without a lifecycle rule (only versioning).
I collected a log from the resharding process, after 10 minutes I canceled it. Got 500MB log (gzipped still 20MB), so I cannot upload it to the bug tracker.
Regards,
Martin
Von: Orit Wasserman <owasserm@xxxxxxxxxx>
Datum: Mittwoch, 17. Januar 2018 um 11:57
An: Martin Emrich <martin.emrich@xxxxxxxxxxx>
Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
Betreff: Re: [ceph-users] Bug in RadosGW resharding? Hangs again...
On Wed, Jan 17, 2018 at 11:45 AM, Martin Emrich <martin.emrich@xxxxxxxxxxx>
wrote:
Hi Orit!
I did some tests, and indeed the combination of Versioning/Lifecycle with Resharding is the problem:
-
If I do not enable Versioning/Lifecycle, Autoresharding works fine.
-
If I disable Autoresharding but enable Versioning+Lifecycle, pushing data works fine, until I manually reshard. This hangs also.
Thanks for testing :) This is very helpful!
My lifecycle rule (which shall remove all versions older than 60 days):
{
"Rules": [{
"Status": "Enabled",
"Prefix": "",
"NoncurrentVersionExpiration": {
"NoncurrentDays": 60
},
"Expiration": {
"ExpiredObjectDeleteMarker": true
},
"ID": "expire-60days"
}]
}
I am currently testing with an application containing customer data, but I am also creating some random test data
to create logs I can share.
I will also test whether the versioning itself is the culprit, or if it is the lifecycle rule.
I am suspecting versioning (never tried it with resharding).
Can you open a tracker issue with all the information?
Regards,
Martin
Hi Martin,
On Mon, Jan 15, 2018 at 6:04 PM, Martin Emrich <martin.emrich@xxxxxxxxxxx>
wrote:
Hi!
After having a completely broken radosgw setup due to damaged buckets, I completely deleted all rgw pools, and started from scratch.
But my problem is reproducible. After pushing ca. 100000 objects into a bucket, the resharding process appears to start, and the bucket is now unresponsive.
Can you share radosgw logs with --debug_rgw=20 --debug_ms=1?
I just see lots of these messages in all rgw logs:
2018-01-15 16:57:45.108826 7fd1779b1700 0 block_while_resharding ERROR: bucket is still resharding, please retry
2018-01-15 16:57:45.119184 7fd1779b1700 0 NOTICE: resharding operation on bucket index detected, blocking
2018-01-15 16:57:45.260751 7fd1120e6700 0 block_while_resharding ERROR: bucket is still resharding, please retry
2018-01-15 16:57:45.280410 7fd1120e6700 0 NOTICE: resharding operation on bucket index detected, blocking
2018-01-15 16:57:45.300775 7fd15b979700 0 block_while_resharding ERROR: bucket is still resharding, please retry
2018-01-15 16:57:45.300971 7fd15b979700 0 WARNING: set_req_state_err err_no=2300 resorting to 500
2018-01-15 16:57:45.301042 7fd15b979700 0 ERROR: RESTFUL_IO(s)->complete_header() returned err=Input/output error
One radosgw process and two OSDs housing the bucket index/metadata are still busy, but it seems to be stuck again.
How long is this resharding process supposed to take? I cannot believe that an application is supposed to block for more than half an hour...
I feel inclined to open a bug report, but I am yet unshure where the problem lies.
Some information:
* 3 RGW processes, 3 OSD hosts with 12 HDD OSDs and 6 SSD OSDs
* Ceph 12.2.2
* Auto-Resharding on, Bucket Versioning & Lifecycle rule enabled.
What life cycle rules do you use?
Thanks,
Martin
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
|