Re: RadosGW performance degradation on the 18 millions objects stored.

Mark Nelson <mnelson@xxxxxxxxxx> · Tue, 13 Sep 2016 09:13:39 -0500

On 09/13/2016 08:17 AM, Stas Starikevich wrote:
Hi All,

Asking your assistance with the RadosGW performance degradation on the
18M objects placed (http://pasteboard.co/g781YI3J.png).
Drops from 620 uploads\s to 180-190 uploads\s.

I made list of tests and see that upload performance degrades in 3-4
times when its number of objects reaches 18M.
Number of OSD's doesn't matter, problem reproduces with 6\18\56 OSD's.
Increasing number of index shards doesn't help. Originally I faced with
the problem when I had 8 shards per bucket, now it's 256, but same picture.
Number of PG's on the default.rgw.buckets.data also makes no difference,
but latest test with 2048 PG's (+nobarrier, +leveldb_compression =
false) shows a bit higher upload rate.

Please do not use nobarrier!  In almost all situations you should 
absolutely not use it!

Problem reproduces even with erasure coding pool (tested 4-2). Erasure
coding gives much higher inodes usage (my first suspicion was in the
lack of cache RAM for inodes), but it doesn't matter - drops on the 18M too.

Moved meta\index pools to the SSD only. Increased number of RGW threads
to 8192. It raised upload\s from 250 to 600 (and no bad gateway errors),
but didn't help with drop at the 18M objects threshold.

Extra tunings
(logbsize=256k,delaylog,allocsize=4M,nobarrier, leveldb_cache_size, leveldb_write_buffer_size, osd_pg_epoch_persisted_max_stale, osd_map_cache_size)
I made on the few latest tests. Didn't help much, but upload rate became
more stable with no drops.

From the HDD's stats I see that on the 18M threshold number of 'read'
requests increases from 2-3 to

Any ideas?

With RGW writes, you are ultimately fighting seek behavior, and it's 
going to get worse the more objects you've written.  There are a variety 
of reasons for this.

1) If you are not using blind buckets, every write is going to result in 
multiple round trips to update the bucket indices (ie more seeks and 
more latency).

2) The more objects you have, the lower the chance that any given 
object's inode/dentry will be cached.  This was even worse in hammer as 
we didn't chunk xattrs at 255 bytes, so RGW metadata would push xattrs 
out of the inode causing yet another seek.  That was fixed for jewel, 
but old objects will still slow things down.

2b) You can increase the dentry/inode cache in the kernel, but this 
comes with a cost.  The more things you have cached, the longer it takes 
syncfs to complete as it has to iterate through all of that cached 
metadata.  This isn't so much a problem as small scale but has proven to 
be a problem on large clusters when there is a ton of memory in the node 
and lots of cached metadata.

3) filestore stores objects for a given PG in a nested directory 
hierarchy that becomes deeper as the number of objects grow.  The number 
of objects you can store before hitting these thresholds depends on the 
number of PGs, the distribution of objects to PGs, and the filestore 
split/merge thresholds.  A deeper directory hierarchy means that there 
are more dentries to keep in cache and a greater likelihood that memory 
pressure may push one of them out and an extra seek will need to be 
performed.

3a) when splitting is happening, ceph will definitely be looking up 
directory xattrs and also will perform a very large number of 
link/unlink operations.  Usually this shouldn't require xattr lookups on 
the objects, but it appears that if selinux is enabled, it may 
(depending on the setup) read security metadata in the xattrs for an 
object to determine if the link/unlink can proceed. This is extremely 
slow and seek intensive.  Even when selinux is not enabled, splitting a 
PG's directory hierarchy is going to involve a certain amount of overhead.

3b) On XFS, files in a directory are all created in the same AG with the 
idea that they will be physically close to each other on the disk.  The 
idea is to hopefully reduce the number of seeks should they all be 
accessed at roughly the same point in time.  When a split happens, new 
subdirectories are created and a portion of the objects in the parent 
directory are moved to the new subdirectories.  The problem is that 
those subdirectories will not necessarily be in the same AG as the 
parent.  As the directory hierarchy grows deeper, the leaf directories 
will become more fragmented until they have objects spread across every 
AG on the disk.

3c) Setting extremely high split/merge thresholds will likely mitigate a 
lot of what is happening here in point 3, but with the cost of making 
readdir potentially very expensive when the number of objects/pg grows 
high (say 100K objects/PG or more).  This is primarily a problem during 
operations that need to list the files in the PG like during recovery.

So what can be done?

1) Make sure the bucket index pool has enough PGs for reasonably good 
distribution.  If you have SSDs for journals, you may want to consider 
co-locating a set of OSDs for the bucket index pool on the SSDs as well. 
 SSDs for journals may also help simply by reducing the write traffic 
to the disks.

2) dentry/inode lookups are tough.  Arguably you could try tweaking the 
kernel dentry/inode cache until you find the right balance for your 
setup.  It may also be possible to use the realtime allocator in XFS to 
put all of the inodes/dentries at the beginning of the volume and use 
LVM to map that portion of the volume to an SSD.  Alternately, dm-cache 
or similar solutions may be able to help keep dentries/inodes on SSDs, 
though with other tradeoffs.  At the very least, using a fresh Jewel 
cluster should work better than

3) filestore splitting/merging is also tough. This problem was one of 
the motivations for writing bluestore.  You can try to increase the 
thresholds, but there's a real risk, especially if selinux is enabled, 
that it could make the split/merge behavior *very* bad when it does 
eventually happen.  You could set the values so large that filestore 
never splits/merges, but we don't know really know what effect a slow 
readdir might have once you hit 100K+ objects per PG.  It's probably 
worth testing though.  Disabling selinux (or finding a way to make 
selinux not read security xattrs) is worth doing if you don't absolutely 
need it.

Ultimately the hope on our part is that bluestore is going to make a lot 
of this much faster.  Bluestore stores data directly on a partition 
without a filesystem in the way.  Metadata is currently stored in 
rocksdb, but that can be swapped out for other key/value stores as time 
goes on (even potentially in-memory using nvdimms).  With bluestore it's 
very easy to put rocksdb on SSDs and or even more finely grained control 
if things like 3D-crosspoint or nvdimms prove to be valuable for 
commonly-accessed metadata.  Bluestore doesn't do journal writes when 
objects are large which helps in purely throughput limited scenarios. 
And importantly, it doesn't have the same kind of nested directory 
hierarchy problem that filestore does.  This isn't to say that it won't 
slow down with more objects.  It's inevitable that there will be some 
performance loss as the disks fill up and there are more objects to keep 
track of.  The hope though, is that it will be much faster in general, 
easier to optimize with a small amount of fast storage (SSD, nvdimm, 
etc), and more resilient to performance degradation as the number 
objects increases.

Mark

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com