Hi, Thanks for your response. > That is definitely the obvious next step, but it's a non-trivial > amount of work and hasn't yet been started on by anybody. This is > probably a good subject for a CDS blueprint! But we want to split our big bucket into the smallest ones. We want to shard it before radosgw. Do you think this is a good idea to make workaround of this problem (big index issues)? Regards Dominik 2013/10/18 Gregory Farnum <greg@xxxxxxxxxxx>: > On Fri, Oct 18, 2013 at 4:01 AM, Dominik Mostowiec > <dominikmostowiec@xxxxxxxxx> wrote: >> Hi, >> I plan to shard my largest bucket because of issues of deep-scrubbing >> (when PG which index for this bucket is stored on is deep-scrubbed, it >> appears many slow requests and OSD grows in memory - after latest >> scrub it grows up to 9G). >> >> I trying to found why large bucket index make issues when it is scrubbed. >> On test cluster: >> radosgw-admin bucket stats --bucket=test1-XX >> { "bucket": "test1-XX", >> "pool": ".rgw.buckets", >> "index_pool": ".rgw.buckets", >> "id": "default.4211.2", >> ... >> >> I guess index is in object .dir.default.4211.2. (pool: .rgw.buckets) >> >> rados -p .rgw.buckets get .dir.default.4211.2 - >> <empty> >> >> But: >> rados -p .rgw.buckets listomapkeys .dir.default.4211.2 >> test_file_2.txt >> test_file_2_11.txt >> test_file_3.txt >> test_file_4.txt >> test_file_5.txt >> >> I guess that list of files are stored in leveldb not in one large file. >> 'omap' files are stored in {osd_dir}/current/omap/, the largest file >> that i found in this directory (on production) have 8.8M. >> >> I'm a little confused. >> >> How list of files (for bucket) is stored? > > The index is stored as a bunch of omap entries in a single object. > >> If list of objects in bucket is splitted on many small files in >> leveldb that large bucket (with many files) should not cause larger >> latency in PUT new object. > > That's not quite how it works. Leveldb has a custom storage format in > which it stores sets of keys based on both time of update and the > value of the key, so the size of the individual files in its directory > has no correlation to the number or size of any given set of entries. > >> Scrubbing also should not be a problem i think ... > > The problem you're running into is that scrubbing is done on an > object-by-object basis, and so the OSD is reading all of the keys > associated with that object out of leveldb, and processing them, at > once. This number can be very much larger than the 8MB file you've > found in the leveldb directory, as discussed above. > >> What you think about using a sharding to split big buckets into the >> smalest one to avoid the problems with big indexes? > > That is definitely the obvious next step, but it's a non-trivial > amount of work and hasn't yet been started on by anybody. This is > probably a good subject for a CDS blueprint! > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com -- Pozdrawiam Dominik -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html