Living with huge bucket sizes

Bryan Stillwell <bstillwell@xxxxxxxxxxx> · Thu, 8 Jun 2017 23:55:05 +0000

This has come up quite a few times before, but since I was only working with
RBD before I didn't pay too close attention to the conversation.  I'm looking
for the best way to handle existing clusters that have buckets with a large
number of objects (>20 million) in them.  The cluster I'm doing test on is
currently running hammer (0.94.10), so if things got better in jewel I would
love to hear about it!

One idea I've played with is to create a new SSD pool by adding an OSD
to every journal SSD.  My thinking was that our data is mostly small
objects (~100KB) so the journal drives were unlikely to be getting close
to any throughput limitations.  They should also have plenty of IOPs
left to handle the .rgw.buckets.index pool.

So on our test cluster I created a separate root that I called
rgw-buckets-index, I added all the OSDs I created on the journal SSDs,
and created a new crush rule to place data on it:

ceph osd crush rule create-simple rgw-buckets-index_ruleset rgw-buckets-index chassis

Once everything was set up correctly I tried switching the
.rgw.buckets.index pool over to it by doing:

ceph osd set norebalance
ceph osd pool set .rgw.buckets.index crush_ruleset 1
# Wait for peering to complete
ceph osd unset norebalance

Things started off well, but once it got to backfilling the PGs which
have the large buckets on them, I started seeing a large number of slow
requests like these:

  ack+ondisk+write+known_if_redirected e68708) currently waiting for degraded object
  ondisk+write+known_if_redirected e68708) currently waiting for degraded object
  ack+ondisk+write+known_if_redirected e68708) currently waiting for rw locks

Digging in on the OSDs, it seems they would either restart or die after
seeing a lot of these messages:

  heartbeat_map is_healthy 'OSD::recovery_tp thread 0x7f8f5d604700' had timed out after 30

or:

  heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f99ec2e4700' had timed out after 15

The ones that died saw messages like these:

  heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fcd59e7c700' had timed out after 60

Followed by:

  heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fcd48c1d700' had suicide timed out after 150

The backfilling process would appear to hang on some of the PGs, but I
figured out that they were recovering omap data and was able to keep an
eye on the process by running:

watch 'ceph pg 272.22 query | grep omap_recovered_to'

A lot of the timeouts happened after the PGs finished the omap recovery,
which took over an hour on one of the PGs.

Has anyone found a good solution for this for existing large buckets?  I
know sharding is the solution going forward, but afaik it can't be done
on existing buckets yet (although the dynamic resharding work mentioned
on today's performance call sounds promising).

Thanks,
Bryan

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com