Dynamic bucket index resharding bug? - rgw.none showing unreal number of objects

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all. Running an Object Storage cluster with Ceph Nautilus 14.2.4.

We are running into what appears to be a serious bug that is affecting our fairly new object storage cluster. While investigating some performance issues -- seeing abnormally high IOPS, extremely slow bucket stat listings (over 3 minutes) -- we noticed some dynamic bucket resharding jobs running. Strangely enough they were resharding buckets that had very few objects. Even more worrying was the number of new shards Ceph was planning: 65521

[root@os1 ~]# radosgw-admin reshard list
[
    {
        "time": "2019-11-22 00:12:40.192886Z",
        "tenant": "",
        "bucket_name": "redacteed",
        "bucket_id": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7000639.20",
        "new_instance_id": "redacted:c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7552496.28",
        "old_num_shards": 1,
        "new_num_shards": 65521
    }
]

Upon further inspection we noticed a seemingly impossible number of objects (18446744073709551603) in rgw.none for the same bucket:
[root@os1 ~]# radosgw-admin bucket stats --bucket=redacted
{
    "bucket": "redacted",
    "tenant": "",
    "zonegroup": "dbb69c5b-b33f-4af2-950c-173d695a4d2c",
    "placement_rule": "default-placement",
    "explicit_placement": {
        "data_pool": "",
        "data_extra_pool": "",
        "index_pool": ""
    },
    "id": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7000639.20",
    "marker": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7000639.20",
    "index_type": "Normal",
    "owner": "d52cb8cc-1f92-47f5-86bf-fb28bc6b592c",
    "ver": "0#12623",
    "master_ver": "0#0",
    "mtime": "2019-11-22 00:18:41.753188Z",
    "max_marker": "0#",
    "usage": {
        "rgw.none": {
            "size": 0,
            "size_actual": 0,
            "size_utilized": 0,
            "size_kb": 0,
            "size_kb_actual": 0,
            "size_kb_utilized": 0,
            "num_objects": 18446744073709551603
        },
        "rgw.main": {
            "size": 63410030,
            "size_actual": 63516672,
            "size_utilized": 63410030,
            "size_kb": 61924,
            "size_kb_actual": 62028,
            "size_kb_utilized": 61924,
            "num_objects": 27
        },
        "rgw.multimeta": {
            "size": 0,
            "size_actual": 0,
            "size_utilized": 0,
            "size_kb": 0,
            "size_kb_actual": 0,
            "size_kb_utilized": 0,
            "num_objects": 0
        }
    },
    "bucket_quota": {
        "enabled": false,
        "check_on_raw": false,
        "max_size": -1,
        "max_size_kb": 0,
        "max_objects": -1
    }
}

It would seem that the unreal number of objects in rgw.none is driving the resharding process, making ceph reshard the bucket 65521 times. I am assuming 65521 is the limit.

I have seen only a couple of references to this issue, none of which had a resolution or much of a conversation around them:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030791.html
https://tracker.ceph.com/issues/37942

For now we are cancelling these resharding jobs since they seem to be causing performance issues with the cluster, but this is an untenable solution. Does anyone know what is causing this? Or how to prevent it/fix it?

Thanks,
Dave Monschein
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux