Re: RGW bucket reshard fails with ERROR: bi_list(): (4) Interrupted system call

Andreas Calminder <andreas.calminder@xxxxxxxxxx> · Mon, 29 May 2017 09:21:38 +0200

As a matter of fact, yes! The sync was falling way behind on one shard
in the data sync, I had to close down the rgw's in the second site to
keep the rgw's in the master from allocating all the memory and then
getting killed by oom_killer. Not sure why the shard won't get synced,
but my guess is that it's due to a delete operation in the oversized
bucket as there was a long running rm command that was aborted after
being run for over 7 days.

Is there a way to manually resync the failing site? As it would seem I
cannot reshard the bucket until the sites are properly connected
again?

Thanks!
/andreas

On 29 May 2017 at 07:55, Василий Ангапов <angapov@xxxxxxxxx> wrote:
> I have almost the same problem except that "bucket reshard" gives me
> "(5) Input/output error" (Red Hat Ceph Storage 2.2 or Ceph 10.2.5).
> Had the discussion with Red Hat Support and they told me that it is
> related to malfunctioning RGW multisite replication. Do you have
> multisite configuration?
>
> Regards, Vasily
>
> 2017-05-26 23:28 GMT+03:00 Andreas Calminder <andreas.calminder@xxxxxxxxxx>:
>> Hi,
>> Posted this in ceph-users earlier, thought I try here as well. Running
>> Jewel (10.2.7). While trying to get rid of an oversized bucket (+14M
>> objects) I tried to reshard the bucket index to be able to remove it
>> without having the rgw run out of memory.
>>
>> As per the Red Hat documentation I ran
>> # radosgw-admin bucket reshard --bucket=oversized_bucket --num-shards=300
>> Noted the old instance id and waited for it to output a count of all
>> items, at the very end the command spits out "ERROR: bi_list(): (4)
>> Interrupted system call"
>>
>> Now I have the new bucket instance with a sharded index (300),
>> seemingly unused and the old instance id of the bucket with no shards,
>> which seems to be active
>>
>> #  radosgw-admin --cluster drceph-tcs-prod metadata get
>> bucket:oversized_bucket returns the old instance id in bucket_id
>>
>> Two questions:
>>
>> * How do I remove the new bucket id, from the failed reshard command.
>> Since it's not used it's confusing to have it floating around
>> * How do I actually reshard the oversized_bucket? - Actually, I really
>> don't care about the bucket, If there's a way to remove the bucket and
>> it's objects without altering the index, causing the radosgw to
>> allocate all memory available and crash, I'd rather do that.
>>
>> Regards,
>> Andreas
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Andreas Calminder
System Administrator
IT Operations Core Services

Klarna AB (publ)
Sveavägen 46, 111 34 Stockholm
Tel: +46 8 120 120 00
Reg no: 556737-0431
klarna.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html