Re: "Lost" buckets on radosgw

Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> · Mon, 21 Nov 2016 15:23:10 -0800

On Mon, Nov 21, 2016 at 3:14 PM, Graham Allan <gta@xxxxxxx> wrote:
>
>
> On 11/21/2016 04:44 PM, Yehuda Sadeh-Weinraub wrote:
>>
>> On Mon, Nov 21, 2016 at 2:42 PM, Graham Allan <gta@xxxxxxx> wrote:
>>>
>>> Following up to this (same problem, looking at it with Jeff)...
>>>
>>> There was definite confusion with the zone/zonegroup/realm/period changes
>>> during the hammer->jewel upgrade. It's possible that our placement
>>> settings
>>> were misplaced at this time.
>>>
>>> However what I find puzzling is that different buckets from the same pool
>>> seem affected - if this were placement related, I'd rather expect all
>>> buckets from one pool to be affected, those in another not. Am I
>>> interpreting this wrongly?
>>>
>>> For example here is one bucket which remains accessible:
>>>
>>>> # radosgw-admin metadata get bucket.instance:gta:default.691974.1
>>>> {
>>>>     "key": "bucket.instance:gta:default.691974.1",
>>>>     "ver": {
>>>>         "tag": "_3Z9nfFjZn97aV2YJ4nFhVuk",
>>>>         "ver": 85
>>>>     },
>>>>     "mtime": "2016-11-11 16:48:02.950760Z",
>>>>     "data": {
>>>>         "bucket_info": {
>>>>             "bucket": {
>>>>                 "name": "gta",
>>>>                 "pool": ".rgw.buckets.ec42",
>>>>                 "data_extra_pool": ".rgw.buckets.extra",
>>>>                 "index_pool": ".rgw.buckets.index",
>>>>                 "marker": "default.691974.1",
>>>>                 "bucket_id": "default.691974.1",
>>>>                 "tenant": ""
>>>>             },
>>>>             "creation_time": "2015-11-13 20:05:26.000000Z",
>>>>             "owner": "gta",
>>>>             "flags": 0,
>>>>             "zonegroup": "default",
>>>>             "placement_rule": "ec42-placement",
>>>>             "has_instance_obj": "true",
>>>>             "quota": {
>>>>                 "enabled": false,
>>>>                 "max_size_kb": -1,
>>>>                 "max_objects": -1
>>>>             },
>>>>             "num_shards": 32,
>>>>             "bi_shard_hash_type": 0,
>>>>             "requester_pays": "false",
>>>>             "has_website": "false",
>>>>             "swift_versioning": "false",
>>>>             "swift_ver_location": ""
>>>>         },
>>>>         "attrs": [
>>>>             {
>>>>                 "key": "user.rgw.acl",
>>>>                 "val":
>>>>
>>>> "AgJ\/AAAAAgIXAAAAAwAAAGd0YQwAAABHcmFoYW0gQWxsYW4DA1wAAAABAQAAAAMAAABndGEPAAAAAQAAAAMAAABndGEDAzcAAAACAgQAAAAAAAAAAwAAAGd0YQAAAAAAAAAAAgIEAAAADwAAAAwAAABHcmFoYW0gQWxsYW4AAAAAAAAAAA=="
>>>>             },
>>>>             {
>>>>                 "key": "user.rgw.idtag",
>>>>                 "val": ""
>>>>             },
>>>>             {
>>>>                 "key": "user.rgw.manifest",
>>>>                 "val": ""
>>>>             }
>>>>         ]
>>>>     }
>>>> }
>>>
>>>
>>>
>>> while here is another, located in the same pool, which is not accessible:
>>>
>>>> # radosgw-admin metadata get bucket.instance:tcga:default.712449.19
>>>> {
>>>>     "key": "bucket.instance:tcga:default.712449.19",
>>>>     "ver": {
>>>>         "tag": "_vm0Og31XbhhtmnuQVZ6cYJP",
>>>>         "ver": 2010
>>>>     },
>>>>     "mtime": "2016-11-19 03:49:03.406938Z",
>>>>     "data": {
>>>>         "bucket_info": {
>>>>             "bucket": {
>>>>                 "name": "tcga",
>>>>                 "pool": ".rgw.buckets.ec42",
>>>>                 "data_extra_pool": ".rgw.buckets.extra",
>>>>                 "index_pool": ".rgw.buckets.index",
>>>>                 "marker": "default.712449.19",
>>>>                 "bucket_id": "default.712449.19",
>>>>                 "tenant": ""
>>>>             },
>>>>             "creation_time": "2016-01-21 20:51:21.000000Z",
>>>>             "owner": "jmcdonal",
>>>>             "flags": 0,
>>>>             "zonegroup": "default",
>>>>             "placement_rule": "ec42-placement",
>>>>             "has_instance_obj": "true",
>>>>             "quota": {
>>>>                 "enabled": false,
>>>>                 "max_size_kb": -1,
>>>>                 "max_objects": -1
>>>>             },
>>>>             "num_shards": 0,
>>>>             "bi_shard_hash_type": 0,
>>>>             "requester_pays": "false",
>>>>             "has_website": "false",
>>>>             "swift_versioning": "false",
>>>>             "swift_ver_location": ""
>>>>         },
>>>>         "attrs": [
>>>>             {
>>>>                 "key": "user.rgw.acl",
>>>>                 "val":
>>>>
>>>> "AgKbAAAAAgIgAAAACAAAAGptY2RvbmFsEAAAAEplZmZyZXkgTWNEb25hbGQDA28AAAABAQAAAAgAAABqbWNkb25hbA8AAAABAAAACAAAAGptY2RvbmFsAwNAAAAAAgIEAAAAAAAAAAgAAABqbWNkb25hbAAAAAAAAAAAAgIEAAAADwAAABAAAABKZWZmcmV5IE1jRG9uYWxkAAAAAAAAAAA="
>>>>             },
>>>>             {
>>>>                 "key": "user.rgw.idtag",
>>>>                 "val": ""
>>>>             },
>>>>             {
>>>>                 "key": "user.rgw.manifest",
>>>>                 "val": ""
>>>>             }
>>>>         ]
>>>>     }
>>>> }
>>>
>>>
>>>
>>> if I do "ls --pool .rgw.buckets.ec42|grep default.712449.19" I can see
>>> objects with the above bucket ID, and fetch them, so I know the data is
>>> there...
>>>
>>> Does this seem like a placement_pool issue, or maybe some other unrelated
>>> issue?
>>>
>>
>> Could be another semi-related issue. Can you provide output of the
>> commands that fail with 'debug rgw = 20' and 'debug ms = 1'?
>>
>> Thanks,
>> Yehuda
>
>
> I captured some radosgw log output when trying to access the failing bucket
> above (bucket.instance:tcga:default.712449.19); the output was on the long
> side to include inline so I put it online here as well as attaching:
>
> http://pastebin.com/F5HJ9EeQ
>
> Also did the command "radosgw-admin bucket stats --bucket=tcga" which also
> errors but produces even more output - won't attach unless you think it
> would be useful.
>
> In both cases it seems related to a directory lookup failure, eg:
>
>> 2016-11-21 16:48:43.489328 7faeea8f1900  1 -- 10.32.16.93:0/3999194905 -->
>> 10.31.0.70:6851/1771318 -- osd_op(client.2588139.0:160 100.4d86b68f
>> .dir.default.712449.19 [call rgw.bucket_list] snapc 0=[]
>> ack+read+known_if_redirected e455622) v7 -- ?+0 0x7faeeb5acfa0 con
>> 0x7faeeb54baa0
>> 2016-11-21 16:48:43.490411 7fae547ce700  1 -- 10.32.16.93:0/3999194905 <==
>> osd.114 10.31.0.70:6851/1771318 5 ==== osd_op_reply(160
>> .dir.default.712449.19 [call] v0'0 uv0 ack = -2 ((2) No such file or
>> directory)) v7 ==== 142+0+0 (2988442841 0 0) 0x7fadfc000c20 con
>> 0x7faeeb54baa0
>> error getting bucket stats ret=-2
>
>
> Would that be related to these objects in the index pool?
>
>> # rados ls --pool .rgw.buckets.index |grep default.712449.19
>> .dir.default.712449.19.25
>> .dir.default.712449.19.4
>> .dir.default.712449.19.10
>> .dir.default.712449.19.15
>> .dir.default.712449.19.28
>> .dir.default.712449.19.8
>> .dir.default.712449.19.26
>> .dir.default.712449.19.22
>> .dir.default.712449.19.11
>> .dir.default.712449.19.6
>> .dir.default.712449.19.20
>> .dir.default.712449.19.7
>> .dir.default.712449.19.14
>> .dir.default.712449.19.2
>> .dir.default.712449.19.21
>> .dir.default.712449.19.30
>> .dir.default.712449.19.12
>> .dir.default.712449.19.27
>> .dir.default.712449.19.0
>> .dir.default.712449.19.1
>> .dir.default.712449.19.31
>> .dir.default.712449.19.29
>> .dir.default.712449.19.13
>> .dir.default.712449.19.16
>> .dir.default.712449.19.24
>> .dir.default.712449.19.18
>> .dir.default.712449.19.19
>> .dir.default.712449.19.23
>> .dir.default.712449.19.9
>> .dir.default.712449.19.5
>> .dir.default.712449.19.17
>> .dir.default.712449.19.3
>
>
>

Seems like bucket was sharded, but for some reason the bucket instance
info does not specify that. I don't know why that would happen, but
maybe running mixed versions could be the culprit. You can try
modifying the bucket instance info for this bucket, change the num
shards param to 32.

Yehuda
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com