Dynamic bucket index resharding bug? - rgw.none showing unreal number of objects

David Monschein <monschein@xxxxxxxxx> · Fri, 22 Nov 2019 11:50:31 -0500

Hi all. Running an Object Storage cluster with Ceph Nautilus 14.2.4.
We are running into what appears to be a serious bug that is affecting our fairly new object storage cluster. While investigating some performance issues -- seeing abnormally high IOPS, extremely slow bucket stat listings (over 3 minutes) -- we noticed some dynamic bucket resharding jobs running. Strangely enough they were resharding buckets that had very few objects. Even more worrying was the number of new shards Ceph was planning: 65521

[root@os1 ~]# radosgw-admin reshard list
[
    {
        "time": "2019-11-22 00:12:40.192886Z",
        "tenant": "",
        "bucket_name": "redacteed",
        "bucket_id": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7000639.20",
        "new_instance_id": "redacted:c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7552496.28",
        "old_num_shards": 1,
        "new_num_shards": 65521
    }
]

Upon further inspection we noticed a seemingly impossible number of objects (18446744073709551603) in rgw.none for the same bucket:
[root@os1 ~]# radosgw-admin bucket stats --bucket=redacted
{
    "bucket": "redacted",
    "tenant": "",
    "zonegroup": "dbb69c5b-b33f-4af2-950c-173d695a4d2c",
    "placement_rule": "default-placement",
    "explicit_placement": {
        "data_pool": "",
        "data_extra_pool": "",
        "index_pool": ""
    },
    "id": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7000639.20",
    "marker": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7000639.20",
    "index_type": "Normal",
    "owner": "d52cb8cc-1f92-47f5-86bf-fb28bc6b592c",
    "ver": "0#12623",
    "master_ver": "0#0",
    "mtime": "2019-11-22 00:18:41.753188Z",
    "max_marker": "0#",
    "usage": {
        "rgw.none": {
            "size": 0,
            "size_actual": 0,
            "size_utilized": 0,
            "size_kb": 0,
            "size_kb_actual": 0,
            "size_kb_utilized": 0,
            "num_objects": 18446744073709551603
        },
        "rgw.main": {
            "size": 63410030,
            "size_actual": 63516672,
            "size_utilized": 63410030,
            "size_kb": 61924,
            "size_kb_actual": 62028,
            "size_kb_utilized": 61924,
            "num_objects": 27
        },
        "rgw.multimeta": {
            "size": 0,
            "size_actual": 0,
            "size_utilized": 0,
            "size_kb": 0,
            "size_kb_actual": 0,
            "size_kb_utilized": 0,
            "num_objects": 0
        }
    },
    "bucket_quota": {
        "enabled": false,
        "check_on_raw": false,
        "max_size": -1,
        "max_size_kb": 0,
        "max_objects": -1
    }
}

It would seem that the unreal number of objects in rgw.none is driving the resharding process, making ceph reshard the bucket 65521 times. I am assuming 65521 is the limit.

I have seen only a couple of references to this issue, none of which had a resolution or much of a conversation around them:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030791.html
https://tracker.ceph.com/issues/37942

For now we are cancelling these resharding jobs since they seem to be causing performance issues with the cluster, but this is an untenable solution. Does anyone know what is causing this? Or how to prevent it/fix it?
Thanks,
Dave Monschein

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com