Re: One object in .rgw.buckets.index causes systemic instability

Wido den Hollander <wido@xxxxxxxx> · Wed, 4 Nov 2015 09:53:26 +0100

On 11/04/2015 09:49 AM, Daniel Schneller wrote:
> We had a similar issue in Firefly, where we had a very large number
> (about 1.500.000) of buckets for a single RGW user. We observed a number
> of slow requests in day-to-day use, but did not think much of it at the
> time.
> 
> At one point the primary OSD managing the list of buckets for that user
> crashed and could not restart, because processing the tremendous amount
> of buckets on startup - which also seemed to be single-threaded, judging
> by to 100% CPU usage we could see - took longer than the
> suicide-timeout. That lead to this OSD crashing again, and again.
> Eventually, it would be marked out and the secondary tried to process
> the list with the same result, leading to a cascading failure.
> 
> While I am quite certain it is a different code path in your case (you
> speak about a handful of buckets), it certainly sounds like the a very
> similar issue. Do you have lots of objects in those few buckets, or are
> they few, but large in size to reach the 30TB? Worst case you might be
> in for a similar procedure as we had to take: Take load off the cluster,
> increase the timeouts to ridiculous levels and copy the data over into a
> more evenly distributed set of buckets (users in our case). Fortunately
> as long as we did not try to write to the problematic buckets, we could
> still read from them.
> 

If you have a high amount of objects in a bucket you might want to give
the new sharding feature a try. The bucket Index objects can become very
large otherwise.

Sharding is available since Hammer.

Wido

> Please notice that this is only a guess, I could be completely wrong.
> 
> Daniel
> 
> On 2015-11-03 13:33:19 +0000, Gerd Jakobovitsch said:
> 
>> Dear all,
>>
>> I have a cluster running hammer (0.94.5), with 5 nodes. The main usage
>> is for S3-compatible object storage.
>> I am getting to a very troublesome problem at a ceph cluster. A single
>> object in the .rgw.buckets.index is not responding to request and
>> takes a very long time while recovering after an osd restart. During
>> this time, the OSDs where this object is mapped got heavily loaded,
>> with high cpu as well as memory usage. At the same time, the directory
>> /var/lib/ceph/osd/ceph-XX/current/omap gets a large number of entries
>> ( > 10000), that won't decrease.
>>
>> Very frequently, I get >100 blocked requests for this object, and the
>> main OSD that stores it ends up accepting no other requests. Very
>> frequently the OSD ends up crashing due to filestore timeout, and
>> getting it up again is very troublesome - it usually has to run alone
>> in the node for a long time, until the object gets recovered, somehow.
>>
>> At the OSD logs, there are several entries like these:
>>  -7051> 2015-11-03 10:46:08.339283 7f776974f700 10 log_client  logged
>> 2015-11-03 10:46:02.942023 osd.63 10.17.0.9:6857/2002 41 : cluster
>> [WRN] slow re
>> quest 120.003081 seconds old, received at 2015-11-03 10:43:56.472825:
>> osd_repop(osd.53.236531:7 34.7
>> 8a7482ff/.dir.default.198764998.1/head//34 v 2369
>> 84'22) currently commit_sent
>>
>>
>> 2015-11-03 10:28:32.405265 7f0035982700  0 log_channel(cluster) log
>> [WRN] : 97 slow requests, 1 included below; oldest blocked for >
>> 2046.502848 secs
>> 2015-11-03 10:28:32.405269 7f0035982700  0 log_channel(cluster) log
>> [WRN] : slow request 1920.676998 seconds old, received at 2015-11-03
>> 09:56:31.7282
>> 24: osd_op(client.210508702.0:14696798 .dir.default.198764998.1 [call
>> rgw.bucket_prepare_op] 15.8a7482ff ondisk+write+known_if_redirected
>> e236956) cur
>> rently waiting for blocked object
>>
>> Is there any way to go deeper into this problem, or to rebuild the
>> .rgw index without loosing data? I currently have 30 TB of data in the
>> cluster - most of it concentrated in a handful of buckets - that I
>> can't loose.
>>
>> Regards.
>> -- 
>>  
>>  
>>  
>>  
>>  
>>  
>>  
>>  
>>
>> -- 
>>
>> As informações contidas nesta mensagem são CONFIDENCIAIS, protegidas
>> pelo sigilo legal e por direitos autorais. A divulgação, distribuição,
>> reprodução ou qualquer forma de utilização do teor deste documento
>> depende de autorização do emissor, sujeitando-se o infrator às sanções
>> legais. Caso esta comunicação tenha sido recebida por engano, favor
>> avisar imediatamente, respondendo esta mensagem.
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com