Re: rbd resize (shrink) taking forever and a day

Josh Durgin <josh.durgin@xxxxxxxxxxx> · Tue, 06 Jan 2015 18:00:51 -0800

On 01/06/2015 04:45 PM, Robert LeBlanc wrote:
Seems like a message bus would be nice. Each opener of an RBD could
subscribe for messages on the bus for that RBD. Anytime the map is
modified a message could be put on the bus to update the others. That
opens up a whole other can of worms though.

Rados' watch/notify functions are used as a limited form of this. That's
how rbd can notice that e.g. snapshots are created or disks are resized
online. With the object map code the idea is to funnel all management
operations like that through a single client that's locked the image
for write access (all handled automatically by librbd).

Using watch/notify to coordinate multi-client access would get complex
and inefficient pretty fast, and in general is best left to cephfs
rather than rbd.

Josh

On Jan 6, 2015 5:35 PM, "Josh Durgin" <josh.durgin@xxxxxxxxxxx
<mailto:josh.durgin@xxxxxxxxxxx>> wrote:

    On 01/06/2015 04:19 PM, Robert LeBlanc wrote:

        The bitmap certainly sounds like it would help shortcut a lot of
        code
        that Xiaoxi mentions. Is the idea that the client caches the bitmap
        for the RBD so it know which OSDs to contact (thus saving a
        round trip
        to the OSD), or only for the OSD to know which objects exist on it's
        disk?

    It's purely at the rbd level, so librbd caches it and maintains its
    consistency. The idea is that since it's kept consistent, librbd can do
    things like delete exactly the objects that exist without any
    extra communication with the osds. Many things that were
    O(size of image) become O(written objects in image).

    The only restriction is that keeping the object map consistent requires
    a single writer, so this does not work for the rare case of e.g. ocfs2
    on top of rbd, where there are multiple clients writing to the same
    rbd image at once.

    Josh

        On Tue, Jan 6, 2015 at 4:19 PM, Josh Durgin
        <josh.durgin@xxxxxxxxxxx <mailto:josh.durgin@xxxxxxxxxxx>> wrote:

            On 01/06/2015 10:24 AM, Robert LeBlanc wrote:

                Can't this be done in parallel? If the OSD doesn't have
                an object then
                it is a noop and should be pretty quick. The number of
                outstanding
                operations can be limited to 100 or a 1000 which would
                provide a
                balance between speed and performance impact if there is
                data to be
                trimmed. I'm not a big fan of a "--skip-trimming" option
                as there is
                the potential to leave some orphan objects that may not
                be cleaned up
                correctly.

            Yeah, a --skip-trimming option seems a bit dangerous. This
            trimming
            actually is parallelized (10 ops at once by default,
            changeable via
            --rbd-concurrent-management-__ops) since dumpling.

            What will really help without being dangerous is keeping a
            map of
            object existence [1]. This will avoid any unnecessary trimming
            automatically, and it should be possible to add to existing
            images.
            It should be in hammer.

            Josh

            [1] https://github.com/ceph/ceph/__pull/2700
            <https://github.com/ceph/ceph/pull/2700>

                On Tue, Jan 6, 2015 at 8:09 AM, Jake Young
                <jak3kaj@xxxxxxxxx <mailto:jak3kaj@xxxxxxxxx>> wrote:

                    On Monday, January 5, 2015, Chen, Xiaoxi
                    <xiaoxi.chen@xxxxxxxxx
                    <mailto:xiaoxi.chen@xxxxxxxxx>> wrote:

                        When you shrinking the RBD, most of the time was
                        spent on
                        librbd/internal.cc::trim___image(), in this
                        function, client will iterator
                        all
                        unnecessary objects(no matter whether it exists)
                        and delete them.

                        So in this case,  when Edwin shrinking his RBD
                        from 650PB to 650GB,
                        there are[ (650PB * 1024GB/PB -650GB) *
                        1024MB/GB ] / 4MB/Object =
                        170,227,200 Objects need to be deleted.That will
                        definitely take a long
                        time
                        since rbd client need to send a delete request
                        to OSD, OSD need to find
                        out
                        the object context and delete(or doesn’t exist
                        at all). The time needed
                        to
                        trim an image is ratio to the size needed to trim.

                        make another image of the correct size and copy
                        your VM's file system to
                        the new image, then delete the old one will  NOT
                        help in general, just
                        because delete the old volume will take exactly
                        the same time as
                        shrinking ,
                        they both need to call trim_image().

                        The solution in my mind may be we can provide a
                        “—skip-triming” flag to
                        skip the trimming. When the administrator
                        absolutely sure there is no
                        written have taken place in the shrinking
                        area(that means there is no
                        object
                        created in these area), they can use this flag
                        to skip the time
                        consuming
                        trimming.

                        How do you think?

                    That sounds like a good solution. Like doing "undo
                    grow image"

                        From: Jake Young [mailto:jak3kaj@xxxxxxxxx
                        <mailto:jak3kaj@xxxxxxxxx>]
                        Sent: Monday, January 5, 2015 9:45 PM
                        To: Chen, Xiaoxi
                        Cc: Edwin Peer; ceph-users@xxxxxxxxxxxxxx
                        <mailto:ceph-users@xxxxxxxxxxxxxx>
                        Subject: Re:  rbd resize (shrink)
                        taking forever and a day

                        On Sunday, January 4, 2015, Chen, Xiaoxi
                        <xiaoxi.chen@xxxxxxxxx
                        <mailto:xiaoxi.chen@xxxxxxxxx>> wrote:

                        You could use rbd info <volume_name>  to see the
                        block_name_prefix, the
                        object name consist like
                        <block_name_prefix>.<sequence___number>,  so for
                        example, rb.0.ff53.3d1b58ba.__00000000e6ad
                        should be the <e6ad>th object
                        of
                        the volume with block_name_prefix
                        rb.0.ff53.3d1b58ba.

                                $ rbd info huge
                                   rbd image 'huge':
                                    size 1024 TB in 268435456 objects
                                    order 22 (4096 kB objects)
                                    block_name_prefix: rb.0.8a14.2ae8944a
                                    format: 1

                        -----Original Message-----
                        From: ceph-users
                        [mailto:ceph-users-bounces@__lists.ceph.com
                        <mailto:ceph-users-bounces@xxxxxxxxxxxxxx>] On
                        Behalf Of
                        Edwin Peer
                        Sent: Monday, January 5, 2015 3:55 AM
                        To: ceph-users@xxxxxxxxxxxxxx
                        <mailto:ceph-users@xxxxxxxxxxxxxx>
                        Subject: Re:  rbd resize (shrink)
                        taking forever and a day

                        Also, which rbd objects are of interest?

                        <snip>
                        ganymede ~ # rados -p client-disk-img0 ls | wc -l
                        1672636
                        </snip>

                        And, all of them have cryptic names like:

                        rb.0.ff53.3d1b58ba.__00000000e6ad
                        rb.0.6d386.1d545c4d.__000000011461
                        rb.0.50703.3804823e.__000000001c28
                        rb.0.1073e.3d1b58ba.__00000000b715
                        rb.0.1d76.2ae8944a.__00000000022d

                        which seem to bear no resemblance to the actual
                        image names that the rbd
                        command line tools understands?

                        Regards,
                        Edwin Peer

                        On 01/04/2015 08:48 PM, Jake Young wrote:

                            On Sunday, January 4, 2015, Dyweni - Ceph-Users
                            <6EXbab4FYk8H@xxxxxxxxxx
                            <mailto:6EXbab4FYk8H@xxxxxxxxxx>
                            <mailto:6EXbab4FYk8H@dyweni.__com
                            <mailto:6EXbab4FYk8H@xxxxxxxxxx>>> wrote:

                                   Hi,

                                   If its the only think in your pool,
                            you could try deleting the
                                   pool instead.

                                   I found that to be faster in my
                            testing; I had created 500TB when
                                   I meant to create 500GB.

                                   Note for the Devs: I would be nice if
                            rbd create/resize would
                                   accept sizes with units (i.e. MB GB
                            TB PB, etc).

                                   On 2015-01-04 08:45, Edwin Peer wrote:

                                       Hi there,

                                       I did something stupid while
                            growing an rbd image. I
                            accidentally
                                       mistook the units of the resize
                            command for bytes instead of
                                       megabytes
                                       and grew an rbd image to 650PB
                            instead of 650GB. This all
                            happened
                                       instantaneously enough, but
                            trying to rectify the mistake is
                                       not going
                                       nearly as well.

                                       <snip>
                                       ganymede ~ # rbd resize --size
                            665600 --allow-shrink
                                       client-disk-img0/vol-x318644f-__0
                                       Resizing image: 1% complete...
                                       </snip>

                                       It took a couple days before it
                            started showing 1% complete
                                       and has
                                       been stuck on 1% for a couple
                            more. At this rate, I should be
                                       able to
                                       shrink the image back to the
                            intended size in about 2016.

                                       Any ideas?

                                       Regards,
                                       Edwin Peer

                            _________________________________________________
                                       ceph-users mailing list
                            ceph-users@xxxxxxxxxxxxxx
                            <mailto:ceph-users@xxxxxxxxxxxxxx>
                            http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
                            <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

                            _________________________________________________
                                   ceph-users mailing list
                            ceph-users@xxxxxxxxxxxxxx
                            <mailto:ceph-users@xxxxxxxxxxxxxx>
                            http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
                            <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

                            You can just delete the rbd header. See
                            Sebastien's excellent blog:

                            http://www.sebastien-han.fr/__blog/2013/12/12/rbd-image-__bigger-than-your
                            <http://www.sebastien-han.fr/blog/2013/12/12/rbd-image-bigger-than-your>
                            -ceph-cluster/

                            Jake

                            _________________________________________________
                            ceph-users mailing list
                            ceph-users@xxxxxxxxxxxxxx
                            <mailto:ceph-users@xxxxxxxxxxxxxx>
                            http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
                            <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

                        _________________________________________________
                        ceph-users mailing list
                        ceph-users@xxxxxxxxxxxxxx
                        <mailto:ceph-users@xxxxxxxxxxxxxx>
                        http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
                        <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
                        _________________________________________________
                        ceph-users mailing list
                        ceph-users@xxxxxxxxxxxxxx
                        <mailto:ceph-users@xxxxxxxxxxxxxx>
                        http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
                        <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

                        Sorry, I misunderstood.

                        The simplest approach to me is to make another
                        image of the correct size
                        and copy your VM's file system to the new image,
                        then delete the old
                        one.

                        The safest thing to do would be to mount the new
                        file system from the VM
                        and do all the formatting / copying from there
                        (the same way you'd move
                        a
                        physical server's root disk to a new physical disk)

                        I would not attempt to hack the rbd header. You
                        open yourself up to some
                        unforeseen problems.

                        Unless one of the ceph developers can comment
                        there is a safe way to
                        shrink an image, assuming we know that the file
                        system has not grown
                        since
                        growing the disk.

                        Jake

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com