Hi All, Asking your assistance with the RadosGW performance degradation on the 18M objects placed (http://pasteboard.co/g781YI3J.png). Drops from 620 uploads\s to 180-190 uploads\s. I made list of tests and see that upload performance degrades in 3-4 times when its number of objects reaches 18M. Number of OSD's doesn't matter, problem reproduces with 6\18\56 OSD's. Increasing number of index shards doesn't help. Originally I faced with the problem when I had 8 shards per bucket, now it's 256, but same picture. Number of PG's on the default.rgw.buckets.data also makes no difference, but latest test with 2048 PG's (+nobarrier, +leveldb_compression = false) shows a bit higher upload rate. Problem reproduces even with erasure coding pool (tested 4-2). Erasure coding gives much higher inodes usage (my first suspicion was in the lack of cache RAM for inodes), but it doesn't matter - drops on the 18M too. Moved meta\index pools to the SSD only. Increased number of RGW threads to 8192. It raised upload\s from 250 to 600 (and no bad gateway errors), but didn't help with drop at the 18M objects threshold. Extra tunings (logbsize=256k,delaylog,allocsize=4M,nobarrier, leveldb_cache_size, leveldb_write_buffer_size, osd_pg_epoch_persisted_max_stale, osd_map_cache_size) I made on the few latest tests. Didn't help much, but upload rate became more stable with no drops. From the HDD's stats I see that on the 18M threshold number of 'read' requests increases from 2-3 to Any ideas? Ceph cluster has 9 nodes: - ceph-mon0{1..3} - 12G RAM, SSD - ceph-node0{1..6} - 24G RAM, 9 OSD's with SSD journals Mon servers have ceph-mons, haproxies (https-ecc) + citweb services. # ceph -v ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374) # ceph -s cluster 9cb0840a-bd73-499a-ae09-eaa75a80bddb health HEALTH_OK monmap e1: 3 mons at {ceph-mon01=10.10.10.21:6789/0,ceph-mon02=10.10.10.22:6789/0,ceph-mon03=10.10.10.23:6789/0} election epoch 8, quorum 0,1,2 ceph-mon01,ceph-mon02,ceph-mon03 osdmap e1476: 62 osds: 62 up, 62 in flags sortbitwise pgmap v68348: 2752 pgs, 12 pools, 2437 GB data, 31208 kobjects 7713 GB used, 146 TB / 153 TB avail 2752 active+clean client io 1043 kB/s rd, 48307 kB/s wr, 1043 op/s rd, 8153 op/s wr ceph osd tree: http://pastebin.com/scNuW0LN ceph df: http://pastebin.com/ZyQByHG4 ceph.conf: http://pastebin.com/9AxVr1gm ceph osd dump: http://pastebin.com/4mesKGD0 Screenshots from the grafana page: Number of objects at the degradation moment: http://pasteboard.co/2B5OZ03d0.png IOPs drop: http://pasteboard.co/2B6vDyKEn.png Disk util raised to 80%: http://pasteboard.co/2B6YREzoC.png Disk operations: http://pasteboard.co/2B7uI5PWB.png Disk operations - reads: http://pasteboard.co/2B8U8E33d.png Thanks. -- Kind regards, Stas Starikevich, CISSP
|
Attachment:
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com