Re: Large rbd

Jason Dillaman <jdillama@xxxxxxxxxx> · Thu, 21 Jan 2021 19:52:00 -0500

On Thu, Jan 21, 2021 at 6:18 PM Chris Dunlop <chris@xxxxxxxxxxxx> wrote:
>
> On Thu, Jan 21, 2021 at 10:57:49AM +0100, Robert Sander wrote:
> > Hi,
> >
> > Am 21.01.21 um 05:42 schrieb Chris Dunlop:
> >
> >> Is there any particular reason for that MAX_OBJECT_MAP_OBJECT_COUNT, or
> >> it just "this is crazy large, if you're trying to go over this you're
> >> doing something wrong, rethink your life..."?
> >
> > IMHO the limit is there because of the way deletion of RBDs work. "rbd
> > rm" has to look for every object, not only the ones that were really
> > created. This would make deleting a very very large RBD take a very very
> > long time.
>
> I wouldn't have though the ceph designers would have put in a hard limit like
> that just to protect people from a long time to delete.

You are free to disable the object-map when creating large images by
specifying the image-features -- or you can increase the object size
from its default 4MiB allocation size (which is honestly really no
different from QCOW2 switching increasing the backing cluster size as
the image grows larger).

The issue is that the size for the object-map for a 1PiB image w/ 4MiB
objects is going to be 268,435,456 backing objects which will require
64MiB of memory to store. It also just so happens that Ceph has a
hard-limit on the maximum object size of around 90MiB if I recall
correctly.

> The removal time may well be a consideration for some but it's not a
> significant issue in this case as the filesystem is intended to last for years
> (the XFS and ZFS it's meant to replace have been around for maybe a decade).
>
> That said, it does take a while. For a 967T rbd (the largest possible w/
> default 4M objects) with a small amount written to it (maybe 4T):
>
> $ rbd info rbd.meta/fs
> rbd image 'fs':
>          size 976 TiB in 255852544 objects
>          order 22 (4 MiB objects)
>          snapshot_count: 0
>          id: 8126791dce2ad3
>          data_pool: rbd.ec.data
>          block_name_prefix: rbd_data.22.8126791dce2ad3
>          format: 2
>          features: layering, exclusive-lock, object-map, fast-diff, deep-flatten, data-pool
>          op_features:
>          flags:
>          create_timestamp: Thu Jan 21 14:03:38 2021
>          access_timestamp: Thu Jan 21 14:03:38 2021
>          modify_timestamp: Thu Jan 21 14:03:38 2021
>
> $ time rbd remove rbd.meta/fs
> real    117m31.183s
> user    116m56.895s
> sys     0m2.101s
>
> The issue is the number of objects. For instance, the same size rbd (967T) but
> created with "--object-size 16M":
>
> $ rbd info rbd.meta/fs
> rbd image 'fs':
>          size 976 TiB in 63963136 objects
>          order 24 (16 MiB objects)
>          ...
> $ time rbd remove rbd.meta/fs
> real    7m23.326s
> user    6m45.201s
> sys     0m1.272s
>
> I don't know if the amount written affects the rbd removal time.

When the object-map is enabled, only written data extents need to be
deleted. W/o the object-map, it would need to issue deletes against
all possible objects.

> >> Rather than a single large rbd, should I be looking at multiple smaller
> >> rbds linked together using lvm or somesuch? What are the tradeoffs?
> >
> > IMHO there are no tradeoffs, there could even be benefits creating a
> > volume group with multiple physical volumes on RBD as the requests can
> > be bettere parallelized (i.e. virtio-single SCSI controller for qemu).
>
> That's a good point, I hadn't considered potential i/o bandwidth benefits.
>
> Thanks,
>
> Chris
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>

-- 
Jason
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx