S3 object appears in ListObject but 404 when issuing a GET

Recently we got a problem from an internal customer on our S3. Our setup consist
of roughly 10 servers with 140 OSDs. Our 3 RGWs are collocated with monitors on
dedicated servers in a HA setup with HAProxy in front. We are running 16.2.14
on Podman with Cephadm.

Our S3 is constantly having a traffic of 500 req/s average per RGW instance.

The problem is described in this issue: https://tracker.ceph.com/issues/63935.

Basically this customer is having a Grafana Mimir instance pushing to our S3 and
during a compaction process it does a special pattern like this:

29/Dec/2023:17:13:28.961 rgw-frontend~ rgw-backend/server-mon-01-rgw0 0/0/0/127/127 200 228 - - ---- 132/132/70/67/0 0/0 "PUT /1234/object HTTP/1.1" 
29/Dec/2023:17:13:29.101 rgw-frontend~ rgw-backend/server-mon-01-rgw0 0/0/0/1/1 200 381 - - ---- 132/132/76/71/0 0/0 "GET /1234/object HTTP/1.1" 
29/Dec/2023:17:13:29.121 rgw-frontend~ rgw-backend/server-mon-01-rgw0 0/0/0/1/1 200 381 - - ---- 132/132/71/59/0 0/0 "GET /1234/object HTTP/1.1" 
29/Dec/2023:17:13:29.137 rgw-frontend~ rgw-backend/server-mon-03-rgw0 0/0/0/4/4 204 153 - - ---- 132/132/71/6/0 0/0 "DELETE /1234/object HTTP/1.1" 
29/Dec/2023:19:03:21.671 rgw-frontend~ rgw-backend/server-mon-03-rgw0 0/0/0/1/1 404 472 - - ---- 55/55/26/0/0 0/0 "GET /1234/object HTTP/1.1" 

It is doing PUT, GET and DELETE in the same second. Afterwards the customer can
see the deleted object when doing a ListObjects in the bucket but if he tries to access it then RGW
returns a 404.

After looking in Ceph, it appears the object has a bucket index entry but the
associated RADOS object does not exist anymore. The bucket does not have
versioning or object locking.

Did someone encounter something similar? Thank you!


Mathias Chapelain
Storage Engineer
Proton AG
