Re: rbd resize (shrink) taking forever and a day

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Tue, 6 Jan 2015 17:19:41 -0700

The bitmap certainly sounds like it would help shortcut a lot of code
that Xiaoxi mentions. Is the idea that the client caches the bitmap
for the RBD so it know which OSDs to contact (thus saving a round trip
to the OSD), or only for the OSD to know which objects exist on it's
disk?

On Tue, Jan 6, 2015 at 4:19 PM, Josh Durgin <josh.durgin@xxxxxxxxxxx> wrote:
> On 01/06/2015 10:24 AM, Robert LeBlanc wrote:
>>
>> Can't this be done in parallel? If the OSD doesn't have an object then
>> it is a noop and should be pretty quick. The number of outstanding
>> operations can be limited to 100 or a 1000 which would provide a
>> balance between speed and performance impact if there is data to be
>> trimmed. I'm not a big fan of a "--skip-trimming" option as there is
>> the potential to leave some orphan objects that may not be cleaned up
>> correctly.
>
>
> Yeah, a --skip-trimming option seems a bit dangerous. This trimming
> actually is parallelized (10 ops at once by default, changeable via
> --rbd-concurrent-management-ops) since dumpling.
>
> What will really help without being dangerous is keeping a map of
> object existence [1]. This will avoid any unnecessary trimming
> automatically, and it should be possible to add to existing images.
> It should be in hammer.
>
> Josh
>
> [1] https://github.com/ceph/ceph/pull/2700
>
>
>> On Tue, Jan 6, 2015 at 8:09 AM, Jake Young <jak3kaj@xxxxxxxxx> wrote:
>>>
>>>
>>>
>>> On Monday, January 5, 2015, Chen, Xiaoxi <xiaoxi.chen@xxxxxxxxx> wrote:
>>>>
>>>>
>>>> When you shrinking the RBD, most of the time was spent on
>>>> librbd/internal.cc::trim_image(), in this function, client will iterator
>>>> all
>>>> unnecessary objects(no matter whether it exists) and delete them.
>>>>
>>>>
>>>>
>>>> So in this case,  when Edwin shrinking his RBD from 650PB to 650GB,
>>>> there are[ (650PB * 1024GB/PB -650GB) * 1024MB/GB ] / 4MB/Object =
>>>> 170,227,200 Objects need to be deleted.That will definitely take a long
>>>> time
>>>> since rbd client need to send a delete request to OSD, OSD need to find
>>>> out
>>>> the object context and delete(or doesn’t exist at all). The time needed
>>>> to
>>>> trim an image is ratio to the size needed to trim.
>>>>
>>>>
>>>>
>>>> make another image of the correct size and copy your VM's file system to
>>>> the new image, then delete the old one will  NOT help in general, just
>>>> because delete the old volume will take exactly the same time as
>>>> shrinking ,
>>>> they both need to call trim_image().
>>>>
>>>>
>>>>
>>>> The solution in my mind may be we can provide a “—skip-triming” flag to
>>>> skip the trimming. When the administrator absolutely sure there is no
>>>> written have taken place in the shrinking area(that means there is no
>>>> object
>>>> created in these area), they can use this flag to skip the time
>>>> consuming
>>>> trimming.
>>>>
>>>>
>>>>
>>>> How do you think?
>>>
>>>
>>>
>>> That sounds like a good solution. Like doing "undo grow image"
>>>
>>>
>>>>
>>>>
>>>> From: Jake Young [mailto:jak3kaj@xxxxxxxxx]
>>>> Sent: Monday, January 5, 2015 9:45 PM
>>>> To: Chen, Xiaoxi
>>>> Cc: Edwin Peer; ceph-users@xxxxxxxxxxxxxx
>>>> Subject: Re:  rbd resize (shrink) taking forever and a day
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sunday, January 4, 2015, Chen, Xiaoxi <xiaoxi.chen@xxxxxxxxx> wrote:
>>>>
>>>> You could use rbd info <volume_name>  to see the block_name_prefix, the
>>>> object name consist like <block_name_prefix>.<sequence_number>,  so for
>>>> example, rb.0.ff53.3d1b58ba.00000000e6ad should be the <e6ad>th object
>>>> of
>>>> the volume with block_name_prefix rb.0.ff53.3d1b58ba.
>>>>
>>>>       $ rbd info huge
>>>>          rbd image 'huge':
>>>>           size 1024 TB in 268435456 objects
>>>>           order 22 (4096 kB objects)
>>>>           block_name_prefix: rb.0.8a14.2ae8944a
>>>>           format: 1
>>>>
>>>> -----Original Message-----
>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
>>>> Edwin Peer
>>>> Sent: Monday, January 5, 2015 3:55 AM
>>>> To: ceph-users@xxxxxxxxxxxxxx
>>>> Subject: Re:  rbd resize (shrink) taking forever and a day
>>>>
>>>> Also, which rbd objects are of interest?
>>>>
>>>> <snip>
>>>> ganymede ~ # rados -p client-disk-img0 ls | wc -l
>>>> 1672636
>>>> </snip>
>>>>
>>>> And, all of them have cryptic names like:
>>>>
>>>> rb.0.ff53.3d1b58ba.00000000e6ad
>>>> rb.0.6d386.1d545c4d.000000011461
>>>> rb.0.50703.3804823e.000000001c28
>>>> rb.0.1073e.3d1b58ba.00000000b715
>>>> rb.0.1d76.2ae8944a.00000000022d
>>>>
>>>> which seem to bear no resemblance to the actual image names that the rbd
>>>> command line tools understands?
>>>>
>>>> Regards,
>>>> Edwin Peer
>>>>
>>>> On 01/04/2015 08:48 PM, Jake Young wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Sunday, January 4, 2015, Dyweni - Ceph-Users
>>>>> <6EXbab4FYk8H@xxxxxxxxxx <mailto:6EXbab4FYk8H@xxxxxxxxxx>> wrote:
>>>>>
>>>>>      Hi,
>>>>>
>>>>>      If its the only think in your pool, you could try deleting the
>>>>>      pool instead.
>>>>>
>>>>>      I found that to be faster in my testing; I had created 500TB when
>>>>>      I meant to create 500GB.
>>>>>
>>>>>      Note for the Devs: I would be nice if rbd create/resize would
>>>>>      accept sizes with units (i.e. MB GB TB PB, etc).
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>      On 2015-01-04 08:45, Edwin Peer wrote:
>>>>>
>>>>>          Hi there,
>>>>>
>>>>>          I did something stupid while growing an rbd image. I
>>>>> accidentally
>>>>>          mistook the units of the resize command for bytes instead of
>>>>>          megabytes
>>>>>          and grew an rbd image to 650PB instead of 650GB. This all
>>>>> happened
>>>>>          instantaneously enough, but trying to rectify the mistake is
>>>>>          not going
>>>>>          nearly as well.
>>>>>
>>>>>          <snip>
>>>>>          ganymede ~ # rbd resize --size 665600 --allow-shrink
>>>>>          client-disk-img0/vol-x318644f-0
>>>>>          Resizing image: 1% complete...
>>>>>          </snip>
>>>>>
>>>>>          It took a couple days before it started showing 1% complete
>>>>>          and has
>>>>>          been stuck on 1% for a couple more. At this rate, I should be
>>>>>          able to
>>>>>          shrink the image back to the intended size in about 2016.
>>>>>
>>>>>          Any ideas?
>>>>>
>>>>>          Regards,
>>>>>          Edwin Peer
>>>>>          _______________________________________________
>>>>>          ceph-users mailing list
>>>>>          ceph-users@xxxxxxxxxxxxxx
>>>>>          http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>>      _______________________________________________
>>>>>      ceph-users mailing list
>>>>>      ceph-users@xxxxxxxxxxxxxx
>>>>>      http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>>
>>>>> You can just delete the rbd header. See Sebastien's excellent blog:
>>>>>
>>>>> http://www.sebastien-han.fr/blog/2013/12/12/rbd-image-bigger-than-your
>>>>> -ceph-cluster/
>>>>>
>>>>> Jake
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>>
>>>> Sorry, I misunderstood.
>>>>
>>>>
>>>>
>>>> The simplest approach to me is to make another image of the correct size
>>>> and copy your VM's file system to the new image, then delete the old
>>>> one.
>>>>
>>>>
>>>>
>>>> The safest thing to do would be to mount the new file system from the VM
>>>> and do all the formatting / copying from there (the same way you'd move
>>>> a
>>>> physical server's root disk to a new physical disk)
>>>>
>>>>
>>>>
>>>> I would not attempt to hack the rbd header. You open yourself up to some
>>>> unforeseen problems.
>>>>
>>>>
>>>>
>>>> Unless one of the ceph developers can comment there is a safe way to
>>>> shrink an image, assuming we know that the file system has not grown
>>>> since
>>>> growing the disk.
>>>>
>>>>
>>>>
>>>> Jake
>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com