This has come up quite a few times before, but since I was only working with RBD before I didn't pay too close attention to the conversation. I'm looking for the best way to handle existing clusters that have buckets with a large number of objects (>20 million) in them. The cluster I'm doing test on is currently running hammer (0.94.10), so if things got better in jewel I would love to hear about it! One idea I've played with is to create a new SSD pool by adding an OSD to every journal SSD. My thinking was that our data is mostly small objects (~100KB) so the journal drives were unlikely to be getting close to any throughput limitations. They should also have plenty of IOPs left to handle the .rgw.buckets.index pool. So on our test cluster I created a separate root that I called rgw-buckets-index, I added all the OSDs I created on the journal SSDs, and created a new crush rule to place data on it: ceph osd crush rule create-simple rgw-buckets-index_ruleset rgw-buckets-index chassis Once everything was set up correctly I tried switching the .rgw.buckets.index pool over to it by doing: ceph osd set norebalance ceph osd pool set .rgw.buckets.index crush_ruleset 1 # Wait for peering to complete ceph osd unset norebalance Things started off well, but once it got to backfilling the PGs which have the large buckets on them, I started seeing a large number of slow requests like these: ack+ondisk+write+known_if_redirected e68708) currently waiting for degraded object ondisk+write+known_if_redirected e68708) currently waiting for degraded object ack+ondisk+write+known_if_redirected e68708) currently waiting for rw locks Digging in on the OSDs, it seems they would either restart or die after seeing a lot of these messages: heartbeat_map is_healthy 'OSD::recovery_tp thread 0x7f8f5d604700' had timed out after 30 or: heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f99ec2e4700' had timed out after 15 The ones that died saw messages like these: heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fcd59e7c700' had timed out after 60 Followed by: heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fcd48c1d700' had suicide timed out after 150 The backfilling process would appear to hang on some of the PGs, but I figured out that they were recovering omap data and was able to keep an eye on the process by running: watch 'ceph pg 272.22 query | grep omap_recovered_to' A lot of the timeouts happened after the PGs finished the omap recovery, which took over an hour on one of the PGs. Has anyone found a good solution for this for existing large buckets? I know sharding is the solution going forward, but afaik it can't be done on existing buckets yet (although the dynamic resharding work mentioned on today's performance call sounds promising). Thanks, Bryan |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com