Re: Large rbd

Chris Dunlop <chris@xxxxxxxxxxxx> · Fri, 22 Jan 2021 10:18:06 +1100

On Thu, Jan 21, 2021 at 10:57:49AM +0100, Robert Sander wrote:
Hi,

Am 21.01.21 um 05:42 schrieb Chris Dunlop:

Is there any particular reason for that MAX_OBJECT_MAP_OBJECT_COUNT, or
it just "this is crazy large, if you're trying to go over this you're
doing something wrong, rethink your life..."?

IMHO the limit is there because of the way deletion of RBDs work. "rbd
rm" has to look for every object, not only the ones that were really
created. This would make deleting a very very large RBD take a very very
long time.

I wouldn't have though the ceph designers would have put in a hard limit like
that just to protect people from a long time to delete.

The removal time may well be a consideration for some but it's not a
significant issue in this case as the filesystem is intended to last for years
(the XFS and ZFS it's meant to replace have been around for maybe a decade).

That said, it does take a while. For a 967T rbd (the largest possible w/
default 4M objects) with a small amount written to it (maybe 4T):

$ rbd info rbd.meta/fs
rbd image 'fs':
        size 976 TiB in 255852544 objects
        order 22 (4 MiB objects)
        snapshot_count: 0
        id: 8126791dce2ad3
        data_pool: rbd.ec.data
        block_name_prefix: rbd_data.22.8126791dce2ad3
        format: 2
        features: layering, exclusive-lock, object-map, fast-diff, deep-flatten, data-pool
        op_features:
        flags:
        create_timestamp: Thu Jan 21 14:03:38 2021
        access_timestamp: Thu Jan 21 14:03:38 2021
        modify_timestamp: Thu Jan 21 14:03:38 2021

$ time rbd remove rbd.meta/fs
real    117m31.183s
user    116m56.895s
sys     0m2.101s

The issue is the number of objects. For instance, the same size rbd (967T) but
created with "--object-size 16M":

$ rbd info rbd.meta/fs
rbd image 'fs':
        size 976 TiB in 63963136 objects
        order 24 (16 MiB objects)
        ...
$ time rbd remove rbd.meta/fs
real    7m23.326s
user    6m45.201s
sys     0m1.272s

I don't know if the amount written affects the rbd removal time.

Rather than a single large rbd, should I be looking at multiple smaller
rbds linked together using lvm or somesuch? What are the tradeoffs?

IMHO there are no tradeoffs, there could even be benefits creating a
volume group with multiple physical volumes on RBD as the requests can
be bettere parallelized (i.e. virtio-single SCSI controller for qemu).

That's a good point, I hadn't considered potential i/o bandwidth benefits.

Thanks,

Chris
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx