Re: Dealing with radosgw and large OSD LevelDBs: compact, start over, something else?

Haomai Wang <haomaiwang@xxxxxxxxx> · Tue, 22 Dec 2015 10:10:25 +0800

On Tue, Dec 22, 2015 at 3:33 AM, Florian Haas <florian@xxxxxxxxxxx> wrote:
On Mon, Dec 21, 2015 at 4:15 PM, Haomai Wang <haomaiwang@xxxxxxxxx> wrote:

>

>

> On Mon, Dec 21, 2015 at 10:55 PM, Florian Haas <florian@xxxxxxxxxxx> wrote:

>>

>> On Mon, Dec 21, 2015 at 3:35 PM, Haomai Wang <haomai@xxxxxxxx> wrote:

>> >

>> >

>> > On Fri, Dec 18, 2015 at 1:16 AM, Florian Haas <florian@xxxxxxxxxxx>

>> > wrote:

>> >>

>> >> Hey everyone,

>> >>

>> >> I recently got my hands on a cluster that has been underperforming in

>> >> terms of radosgw throughput, averaging about 60 PUTs/s with 70K

>> >> objects where a freshly-installed cluster with near-identical

>> >> configuration would do about 250 PUTs/s. (Neither of these values are

>> >> what I'd consider high throughput, but this is just to give you a feel

>> >> about the relative performance hit.)

>> >>

>> >> Some digging turned up that of the less than 200 buckets in the

>> >> cluster, about 40 held in excess of a million objects (1-4M), which

>> >> one bucket being an outlier with 45M objects. All buckets were created

>> >> post-Hammer, and use 64 index shards. The total number of objects in

>> >> radosgw is approx. 160M.

>> >>

>> >> Now this isn't a large cluster in terms of OSD distribution; there are

>> >> only 12 OSDs (after all, we're only talking double-digit terabytes

>> >> here). In almost all of these OSDs, the LevelDB omap directory has

>> >> grown to a size of 10-20 GB.

>> >>

>> >> So I have several questions on this:

>> >>

>> >> - Is it correct to assume that such a large LevelDB would be quite

>> >> detrimental to radosgw performance overall?

>> >>

>> >> - If so, would clearing that one large bucket and distributing the

>> >> data over several new buckets reduce the LevelDB size at all?

>> >>

>> >> - Is there even something akin to "ceph mon compact" for OSDs?

>> >>

>> >> - Are these large LevelDB databases a simple consequence of having a

>> >> combination of many radosgw objects and few OSDs, with the

>> >> distribution per-bucket being comparatively irrelevant?

>> >>

>> >> I do understand that the 45M object bucket itself would have been a

>> >> problem pre-Hammer, with no index sharding available. But with what

>> >> others have shared here, a rule of thumb of one index shard per

>> >> million objects should be a good one to follow, so 64 shards for 45M

>> >> objects doesn't strike me as totally off the mark. That's why I think

>> >> LevelDB I/O is actually the issue here. But I might be totally wrong;

>> >> all insights appreciated. :)

>> >

>> >

>> > Do you enable bucket index sharding?

>>

>> As stated above, yes. 64 shards.

>>

>> > I'm not sure your bottleneck regard to your cluster, I guess you could

>> > disable leveldb compression to test whether reduce compaction influence.

>>

>> Hmmm, you mean with "leveldb_compression = false"? Could you explain

>> why exactly *disabling* compression would help with large omaps?

>>

>> Also, would "osd_compact_leveldb_on_mount" (undocumented) help here?

>> It looks to me like that is an option with no actual implementing

>> code, but I may be missing something.

>>

>> The similarly named leveldb_compact_on_mount seems to only compact

>> LevelDB data in LevelDBStore. But I may be mistaken there too, as that

>> option also seems to be undocumented. Would configuring an osd with

>> leveldb_compact_on_mount=true do omap compaction on OSD daemon

>> startup, in a FileStore OSD?

>

>

> I don't have exact info to sure this is the problem for your case, before I

> met this problem and because leveldb own single compaction thread which

> consume lots of time on compress/uncompress to do compaction.

>

> what's your version, I guess "leveldb_compression" or

> "osd_leveldb_compression" can help

This is on Hammer.

Could you please clarify the semantics of leveldb_compact_on_mount and

leveldb_compression for OSDs though? Like I said, it looks like

neither of those options is documented anywhere.

"leveldb_compact_on_mount": when osd boot, it will try to manually call compact, this produce may consume lots of time while booting
"leveldb_compression": it's a option pass to leveldb internal, leveldb will compress each freeze L1+ block, so when iterate leveldb or compaction lots of blocks need to be compressed and uncompressed

Cheers,

Florian

-- 

Best Regards,
Wheat

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com