Large rbd

Chris Dunlop <chris@xxxxxxxxxxxx> · Thu, 21 Jan 2021 15:42:04 +1100

Hi,

What limits are there on the "reasonable size" of an rbd?

E.g. when I try to create a 1 PB rbd with default 4 MiB objects on my 
octopus cluster:

$ rbd create --size 1P --data-pool rbd.ec rbd.meta/fs
2021-01-20T18:19:35.799+1100 7f47a99253c0 -1 librbd::image::CreateRequest: validate_layout: image size not compatible with object

...which somes from:

== src/librbd/image/CreateRequest.cc
bool validate_layout(CephContext *cct, uint64_t size, file_layout_t &layout) {
  if (!librbd::ObjectMap<>::is_compatible(layout, size)) {
    lderr(cct) << "image size not compatible with object map" << dendl;
    return false;
  }

== src/librbd/ObjectMap.cc
template <typename I>
  bool ObjectMap<I>::is_compatible(const file_layout_t& layout, uint64_t size) {
    uint64_t object_count = Striper::get_num_objects(layout, size);
    return (object_count <= cls::rbd::MAX_OBJECT_MAP_OBJECT_COUNT);
  }

== src/cls/rbd/cls_rbd_types.h
static const uint32_t MAX_OBJECT_MAP_OBJECT_COUNT = 256000000;

For 4 MiB objects that object count equates to just over 976 TB.

Is there any particular reason for that MAX_OBJECT_MAP_OBJECT_COUNT, or it 
just "this is crazy large, if you're trying to go over this you're doing 
something wrong, rethink your life..."?

Yes, I realise I can increase the size of the objects to get a larger rbd, 
or drop the object-map support (and the fast-diff that goes along with 
it).

I'm SO glad I found this limit now rather than starting on a smaller rbd 
and a finding the limit when I tried to grow the rbd underneath a rapidly 
filling filesystem.

What else should I know?

Background: I currently have nearly 0.5 PB on XFS (on lvm / raid6) and ZFS 
that I'm looking to move over to ceph. XFS is a requirement, for the 
reflinking (sadly not yet available in CephFS: https://tracker.ceph.com/issues/1680). 
The recommendation for XFS is to start larger, on a thin-provisioned store 
(hello rbd!), rather than start smaller and grow as needed - e.g. see the 
thread surrounding:

https://www.spinics.net/lists/linux-xfs/msg20099.html

Rather than a single large rbd, should I be looking at multiple smaller 
rbds linked together using lvm or somesuch? What are the tradeoffs?

And whilst we're here... for an rbd with the data on an erasure-coded 
pool, how do you calculate the amount of rbd metadata required if/when the 
rbd data is fully allocated?

Cheers,

Chris
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx