Living with huge bucket sizes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



This has come up quite a few times before, but since I was only working with

RBD before I didn't pay too close attention to the conversation.  I'm looking

for the best way to handle existing clusters that have buckets with a large

number of objects (>20 million) in them.  The cluster I'm doing test on is

currently running hammer (0.94.10), so if things got better in jewel I would

love to hear about it!

 

One idea I've played with is to create a new SSD pool by adding an OSD

to every journal SSD.  My thinking was that our data is mostly small

objects (~100KB) so the journal drives were unlikely to be getting close

to any throughput limitations.  They should also have plenty of IOPs

left to handle the .rgw.buckets.index pool.

 

So on our test cluster I created a separate root that I called

rgw-buckets-index, I added all the OSDs I created on the journal SSDs,

and created a new crush rule to place data on it:

 

ceph osd crush rule create-simple rgw-buckets-index_ruleset rgw-buckets-index chassis

 

Once everything was set up correctly I tried switching the

.rgw.buckets.index pool over to it by doing:

 

ceph osd set norebalance

ceph osd pool set .rgw.buckets.index crush_ruleset 1

# Wait for peering to complete

ceph osd unset norebalance

 

Things started off well, but once it got to backfilling the PGs which

have the large buckets on them, I started seeing a large number of slow

requests like these:

 

  ack+ondisk+write+known_if_redirected e68708) currently waiting for degraded object

  ondisk+write+known_if_redirected e68708) currently waiting for degraded object

  ack+ondisk+write+known_if_redirected e68708) currently waiting for rw locks

 

Digging in on the OSDs, it seems they would either restart or die after

seeing a lot of these messages:

 

  heartbeat_map is_healthy 'OSD::recovery_tp thread 0x7f8f5d604700' had timed out after 30

 

or:

 

  heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f99ec2e4700' had timed out after 15

 

The ones that died saw messages like these:

 

  heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fcd59e7c700' had timed out after 60

 

Followed by:

 

  heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fcd48c1d700' had suicide timed out after 150

 

 

The backfilling process would appear to hang on some of the PGs, but I

figured out that they were recovering omap data and was able to keep an

eye on the process by running:

 

watch 'ceph pg 272.22 query | grep omap_recovered_to'

 

A lot of the timeouts happened after the PGs finished the omap recovery,

which took over an hour on one of the PGs.

 

Has anyone found a good solution for this for existing large buckets?  I

know sharding is the solution going forward, but afaik it can't be done

on existing buckets yet (although the dynamic resharding work mentioned

on today's performance call sounds promising).

 

Thanks,

Bryan

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux