Re: Large OSD omap directories (LevelDBs)

<george.vasilakakos@xxxxxxxxxx> · Tue, 23 May 2017 13:27:57 +0000

Hi Wido,

I see your point. I would expect OMAPs to grow with the number of objects but multiple OSDs getting to multiple tens of GBs for their omaps seems excessive. I find it difficult to believe that not sharding the index for a bucket of 500k objects in RGW causes the 10 largest OSD omaps to grow to a total 512GB which is about 2000 greater that than size of 10 average omaps. Given the relative usage of our pools and the much greater prominence of our non-RGW pools on the OSDs with huge omaps I'm not inclined to think this is caused by some RGW configuration (or lack thereof).

It's also worth pointing out that we've seen problems with files being slow to retrieve (I'm talking about rados get doing 120MB/sec on one file and 2MB/sec on another) and subsequently the omap of the OSD hosting the first stripe of those growing from 30MB to 5GB in the span of an hour during which the logs are flooded with LevelDB compaction activity.

Best regards,

George
________________________________________
From: Wido den Hollander [wido@xxxxxxxx]
Sent: 23 May 2017 14:00
To: Vasilakakos, George (STFC,RAL,SC); ceph-users@xxxxxxxxxxxxxx
Subject: Re:  Large OSD omap directories (LevelDBs)

> Op 23 mei 2017 om 13:01 schreef george.vasilakakos@xxxxxxxxxx:
>
>
> > Your RGW buckets, how many objects in them, and do they have the index
> > sharded?
>
> > I know we have some very large & old buckets (10M+ RGW objects in a
> > single bucket), with correspondingly large OMAPs wherever that bucket
> > index is living (sufficently large that trying to list the entire thing
> > online is fruitless). ceph's pgmap status says we have 2G RADOS objects
> > however, and you're only at 61M RADOS objects.
>
>
> According to radosgw-admin bucket stats the most populous bucket contains 568101 objects. There is no index sharding. The default.rgw.buckets.data pool contains 4162566 objects, I think striping is done by default for 4MB sizes stripes.
>

Without index sharding 500k objects in a bucket can already cause larger OMAP directories. I'd recommend that you at least start to shard them.

Wido

> Bear in mind RGW is a small use case for us currently.
> Most of the data lives in a pool that is accessed by specialized servers that have plugins based on libradosstriper. That pool stores around 1.8 PB in 32920055 objects.
>
> One thing of note is that we have this:
> filestore_xattr_use_omap=1
> in our ceph.conf and libradosstriper makes use of xattrs for striping metadata and locking mechanisms.
>
> This seems to have been removed some time ago but the question is could have any effect? This cluster was built in January and ran Jewel initially.
>
> I do see the xattrs in XFS but a sampling of an omap dir from an OSD showed like there might be some xattrs in there too.
>
> I'm going to try restarting an OSD with a big omap and also extracting a copy of one for further inspection.
> It seems to me like they might not be cleaning up old data. I'm fairly certain an active cluster would've compacted enough for 3 month old SSTs to go away.
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com