Boris, To check if your issue is related to Rafael's, could you check your access logs for requests on the missing objects which lasted longer than one hour? I ask because Nautilus also has rgw_gc_obj_min_wait (2hr by default), which is the main config option related to https://tracker.ceph.com/issues/47866 -- Dan On Thu, Jul 22, 2021 at 11:12 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > > Hi Rafael, > > AFAIU, that gc issue was not relevant for N -- the bug is in the new > rgw_gc code which landed in Octopus and was not backported to N. > > Well, RHCEPH had the new rgw_gc cls backported to it, and RHCEPH has > the bugfix you refer to: > * Wed Dec 02 2020 Ceph Jenkins <ceph-jenkins@xxxxxxxxxx> 2:14.2.11-86 > - rgw: during GC defer, prevent new GC enqueue (rhbz#1892644) > https://bugzilla.redhat.com/show_bug.cgi?id=1892644 > > But still, I think it shouldn't apply to the upstream community > Nautilus that we run. > > That said, this indeed looks really similar so perhaps Nautilus has > similar faulty gc logic. > > Cheers, Dan > > On Thu, Jul 22, 2021 at 6:47 AM Rafael Lopez <rafael.lopez@xxxxxxxxxx> wrote: > > > > hi boris, > > > > We hit an issue late last year that sounds similar to what you are experiencing. I am not sure if the fix was backported to nautilus, I can't see any reference to a nautilus backport so it's possible it was only backported to octopus (15.x), exception being red hat ceph nautilus. > > > > https://tracker.ceph.com/issues/47866?next_issue_id=48255#note-59 > > https://www.mail-archive.com/ceph-users@xxxxxxx/msg05312.html > > > > Basically, a read request on a s3/swift object that took a very long time to complete would cause the associated rados data objects to be put in the GC queue, but the head object would still be present. So the s3 object would still show as present, `rados bi list` would show it (since head object was present) but the data objects would be gone, resulting in 404 NoSuchKey when retrieving the object. > > > > raf > > > > On Wed, 21 Jul 2021 at 18:12, Boris Behrens <bb@xxxxxxxxx> wrote: > >> > >> Good morning everybody, > >> > >> we've dug further into it but still don't know how this could happen. > >> What we ruled out for now: > >> * Orphan objects cleanup process. > >> ** There is only one bucket with missing data (I checked all other > >> buckets yesterday) > >> ** The "keep this files" list is generated by radosgw-admin bukcet > >> rados list. I would doubt that there were files listed, that are > >> accessible via radosgw > >> ** The deleted files are somewhat random, but always with their > >> corresponding counterparts (per folder there are 2-3 files that belong > >> together) > >> > >> * Customer remove his data, but radosgw didn't clean up the bucket index > >> ** there are no delete requests in the buckets usage log. > >> ** customer told us, that they do not have a delete job for this bucket > >> > >> So I am lost with ideas that I could check, and hope that you people > >> might be able to help with further ideas. > >> > >> > >> > >> > >> -- > >> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend > >> im groüen Saal. > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@xxxxxxx > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > > > > > -- > > Rafael Lopez > > Devops Systems Engineer > > Monash University eResearch Centre > > > > E: rafael.lopez@xxxxxxxxxx > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx