Re: Dealing with radosgw and large OSD LevelDBs: compact, start over, something else?

Florian Haas <florian@xxxxxxxxxxx> · Tue, 22 Dec 2015 08:27:40 +0100

On Tue, Dec 22, 2015 at 3:10 AM, Haomai Wang <haomaiwang@xxxxxxxxx> wrote:
>> >> >> Hey everyone,
>> >> >>
>> >> >> I recently got my hands on a cluster that has been underperforming
>> >> >> in
>> >> >> terms of radosgw throughput, averaging about 60 PUTs/s with 70K
>> >> >> objects where a freshly-installed cluster with near-identical
>> >> >> configuration would do about 250 PUTs/s. (Neither of these values
>> >> >> are
>> >> >> what I'd consider high throughput, but this is just to give you a
>> >> >> feel
>> >> >> about the relative performance hit.)
>> >> >>
>> >> >> Some digging turned up that of the less than 200 buckets in the
>> >> >> cluster, about 40 held in excess of a million objects (1-4M), which
>> >> >> one bucket being an outlier with 45M objects. All buckets were
>> >> >> created
>> >> >> post-Hammer, and use 64 index shards. The total number of objects in
>> >> >> radosgw is approx. 160M.
>> >> >>
>> >> >> Now this isn't a large cluster in terms of OSD distribution; there
>> >> >> are
>> >> >> only 12 OSDs (after all, we're only talking double-digit terabytes
>> >> >> here). In almost all of these OSDs, the LevelDB omap directory has
>> >> >> grown to a size of 10-20 GB.
>> >> >>
>> >> >> So I have several questions on this:
>> >> >>
>> >> >> - Is it correct to assume that such a large LevelDB would be quite
>> >> >> detrimental to radosgw performance overall?
>> >> >>
>> >> >> - If so, would clearing that one large bucket and distributing the
>> >> >> data over several new buckets reduce the LevelDB size at all?
>> >> >>
>> >> >> - Is there even something akin to "ceph mon compact" for OSDs?
>> >> >>
>> >> >> - Are these large LevelDB databases a simple consequence of having a
>> >> >> combination of many radosgw objects and few OSDs, with the
>> >> >> distribution per-bucket being comparatively irrelevant?
>> >> >>
>> >> >> I do understand that the 45M object bucket itself would have been a
>> >> >> problem pre-Hammer, with no index sharding available. But with what
>> >> >> others have shared here, a rule of thumb of one index shard per
>> >> >> million objects should be a good one to follow, so 64 shards for 45M
>> >> >> objects doesn't strike me as totally off the mark. That's why I
>> >> >> think
>> >> >> LevelDB I/O is actually the issue here. But I might be totally
>> >> >> wrong;
>> >> >> all insights appreciated. :)
>> >> >
>> >> >
>> >> > Do you enable bucket index sharding?
>> >>
>> >> As stated above, yes. 64 shards.
>> >>
>> >> > I'm not sure your bottleneck regard to your cluster, I guess you
>> >> > could
>> >> > disable leveldb compression to test whether reduce compaction
>> >> > influence.
>> >>
>> >> Hmmm, you mean with "leveldb_compression = false"? Could you explain
>> >> why exactly *disabling* compression would help with large omaps?
>> >>
>> >> Also, would "osd_compact_leveldb_on_mount" (undocumented) help here?
>> >> It looks to me like that is an option with no actual implementing
>> >> code, but I may be missing something.
>> >>
>> >> The similarly named leveldb_compact_on_mount seems to only compact
>> >> LevelDB data in LevelDBStore. But I may be mistaken there too, as that
>> >> option also seems to be undocumented. Would configuring an osd with
>> >> leveldb_compact_on_mount=true do omap compaction on OSD daemon
>> >> startup, in a FileStore OSD?
>> >
>> >
>> > I don't have exact info to sure this is the problem for your case,
>> > before I
>> > met this problem and because leveldb own single compaction thread which
>> > consume lots of time on compress/uncompress to do compaction.
>> >
>> > what's your version, I guess "leveldb_compression" or
>> > "osd_leveldb_compression" can help
>>
>> This is on Hammer.
>>
>> Could you please clarify the semantics of leveldb_compact_on_mount and
>> leveldb_compression for OSDs though? Like I said, it looks like
>> neither of those options is documented anywhere.
>
>
> "leveldb_compact_on_mount": when osd boot, it will try to manually call
> compact, this produce may consume lots of time while booting
> "leveldb_compression": it's a option pass to leveldb internal, leveldb will
> compress each freeze L1+ block, so when iterate leveldb or compaction lots
> of blocks need to be compressed and uncompressed

Okay, thank you. So to summarize,

- leveldb_compact_on_mount does compaction on boot, which may consume
a lot of time for a 20GB omap, and is off by default.

- leveldb_compression does compression on every write and
uncompression on every read (which may be slow if the omap is large
and needs to be iterated), and is on by default.

Now that raises a few more questions (sorry for persisting here — I
really want to get to the bottom of this):

- If an omap is already 20G in size, how much larger will it get with
compression disabled?

- How exactly would slow omap *iteration* also significantly slow down
radosgw object *creation* (as evident from rest-bench), where there
really shouldn't be any iteration involved? Or does radosgw have to
enumerate *all* LevelDB entries associated with the bucket index
object(s) for some reason, before it can update the index?

- Is your suggestion, given the scenario I've described here, to
enable leveldb_compact_on_mount and disable leveldb_compression? (I
believe it is, just making sure.)

- If large, compressed, uncompacted omap directories cause radosgw to
slow down significantly, wouldn't it be a better idea to reverse the
defaults (meaning to enable compaction and disable compression)?

Cheers,
Florian
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com