Re: RadosGW performance degradation on the 18 millions objects stored.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 09/13/2016 08:17 AM, Stas Starikevich wrote:
Hi All,

Asking your assistance with the RadosGW performance degradation on the
18M objects placed (http://pasteboard.co/g781YI3J.png).
Drops from 620 uploads\s to 180-190 uploads\s.

I made list of tests and see that upload performance degrades in 3-4
times when its number of objects reaches 18M.
Number of OSD's doesn't matter, problem reproduces with 6\18\56 OSD's.
Increasing number of index shards doesn't help. Originally I faced with
the problem when I had 8 shards per bucket, now it's 256, but same picture.
Number of PG's on the default.rgw.buckets.data also makes no difference,
but latest test with 2048 PG's (+nobarrier, +leveldb_compression =
false) shows a bit higher upload rate.

Please do not use nobarrier! In almost all situations you should absolutely not use it!

Problem reproduces even with erasure coding pool (tested 4-2). Erasure
coding gives much higher inodes usage (my first suspicion was in the
lack of cache RAM for inodes), but it doesn't matter - drops on the 18M too.

Moved meta\index pools to the SSD only. Increased number of RGW threads
to 8192. It raised upload\s from 250 to 600 (and no bad gateway errors),
but didn't help with drop at the 18M objects threshold.

Extra tunings
(logbsize=256k,delaylog,allocsize=4M,nobarrier, leveldb_cache_size, leveldb_write_buffer_size, osd_pg_epoch_persisted_max_stale, osd_map_cache_size)
I made on the few latest tests. Didn't help much, but upload rate became
more stable with no drops.

From the HDD's stats I see that on the 18M threshold number of 'read'
requests increases from 2-3 to

Any ideas?

With RGW writes, you are ultimately fighting seek behavior, and it's going to get worse the more objects you've written. There are a variety of reasons for this.

1) If you are not using blind buckets, every write is going to result in multiple round trips to update the bucket indices (ie more seeks and more latency).

2) The more objects you have, the lower the chance that any given object's inode/dentry will be cached. This was even worse in hammer as we didn't chunk xattrs at 255 bytes, so RGW metadata would push xattrs out of the inode causing yet another seek. That was fixed for jewel, but old objects will still slow things down.

2b) You can increase the dentry/inode cache in the kernel, but this comes with a cost. The more things you have cached, the longer it takes syncfs to complete as it has to iterate through all of that cached metadata. This isn't so much a problem as small scale but has proven to be a problem on large clusters when there is a ton of memory in the node and lots of cached metadata.

3) filestore stores objects for a given PG in a nested directory hierarchy that becomes deeper as the number of objects grow. The number of objects you can store before hitting these thresholds depends on the number of PGs, the distribution of objects to PGs, and the filestore split/merge thresholds. A deeper directory hierarchy means that there are more dentries to keep in cache and a greater likelihood that memory pressure may push one of them out and an extra seek will need to be performed.

3a) when splitting is happening, ceph will definitely be looking up directory xattrs and also will perform a very large number of link/unlink operations. Usually this shouldn't require xattr lookups on the objects, but it appears that if selinux is enabled, it may (depending on the setup) read security metadata in the xattrs for an object to determine if the link/unlink can proceed. This is extremely slow and seek intensive. Even when selinux is not enabled, splitting a PG's directory hierarchy is going to involve a certain amount of overhead.

3b) On XFS, files in a directory are all created in the same AG with the idea that they will be physically close to each other on the disk. The idea is to hopefully reduce the number of seeks should they all be accessed at roughly the same point in time. When a split happens, new subdirectories are created and a portion of the objects in the parent directory are moved to the new subdirectories. The problem is that those subdirectories will not necessarily be in the same AG as the parent. As the directory hierarchy grows deeper, the leaf directories will become more fragmented until they have objects spread across every AG on the disk.

3c) Setting extremely high split/merge thresholds will likely mitigate a lot of what is happening here in point 3, but with the cost of making readdir potentially very expensive when the number of objects/pg grows high (say 100K objects/PG or more). This is primarily a problem during operations that need to list the files in the PG like during recovery.

So what can be done?

1) Make sure the bucket index pool has enough PGs for reasonably good distribution. If you have SSDs for journals, you may want to consider co-locating a set of OSDs for the bucket index pool on the SSDs as well. SSDs for journals may also help simply by reducing the write traffic to the disks.

2) dentry/inode lookups are tough. Arguably you could try tweaking the kernel dentry/inode cache until you find the right balance for your setup. It may also be possible to use the realtime allocator in XFS to put all of the inodes/dentries at the beginning of the volume and use LVM to map that portion of the volume to an SSD. Alternately, dm-cache or similar solutions may be able to help keep dentries/inodes on SSDs, though with other tradeoffs. At the very least, using a fresh Jewel cluster should work better than

3) filestore splitting/merging is also tough. This problem was one of the motivations for writing bluestore. You can try to increase the thresholds, but there's a real risk, especially if selinux is enabled, that it could make the split/merge behavior *very* bad when it does eventually happen. You could set the values so large that filestore never splits/merges, but we don't know really know what effect a slow readdir might have once you hit 100K+ objects per PG. It's probably worth testing though. Disabling selinux (or finding a way to make selinux not read security xattrs) is worth doing if you don't absolutely need it.

Ultimately the hope on our part is that bluestore is going to make a lot of this much faster. Bluestore stores data directly on a partition without a filesystem in the way. Metadata is currently stored in rocksdb, but that can be swapped out for other key/value stores as time goes on (even potentially in-memory using nvdimms). With bluestore it's very easy to put rocksdb on SSDs and or even more finely grained control if things like 3D-crosspoint or nvdimms prove to be valuable for commonly-accessed metadata. Bluestore doesn't do journal writes when objects are large which helps in purely throughput limited scenarios. And importantly, it doesn't have the same kind of nested directory hierarchy problem that filestore does. This isn't to say that it won't slow down with more objects. It's inevitable that there will be some performance loss as the disks fill up and there are more objects to keep track of. The hope though, is that it will be much faster in general, easier to optimize with a small amount of fast storage (SSD, nvdimm, etc), and more resilient to performance degradation as the number objects increases.

Mark


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux