Re: rbd resize (shrink) taking forever and a day

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Tue, 6 Jan 2015 17:45:39 -0700

Seems like a message bus would be nice. Each opener of an RBD could subscribe for messages on the bus for that RBD. Anytime the map is modified a message could be put on the bus to update the others. That opens up a whole other can of worms though. 
Robert LeBlanc
Sent from a mobile device please excuse any typos.
On Jan 6, 2015 5:35 PM, "Josh Durgin" <josh.durgin@xxxxxxxxxxx> wrote:
On 01/06/2015 04:19 PM, Robert LeBlanc wrote:

The bitmap certainly sounds like it would help shortcut a lot of code

that Xiaoxi mentions. Is the idea that the client caches the bitmap

for the RBD so it know which OSDs to contact (thus saving a round trip

to the OSD), or only for the OSD to know which objects exist on it's

disk?

It's purely at the rbd level, so librbd caches it and maintains its

consistency. The idea is that since it's kept consistent, librbd can do

things like delete exactly the objects that exist without any

extra communication with the osds. Many things that were

O(size of image) become O(written objects in image).

The only restriction is that keeping the object map consistent requires

a single writer, so this does not work for the rare case of e.g. ocfs2

on top of rbd, where there are multiple clients writing to the same

rbd image at once.

Josh

On Tue, Jan 6, 2015 at 4:19 PM, Josh Durgin <josh.durgin@xxxxxxxxxxx> wrote:

On 01/06/2015 10:24 AM, Robert LeBlanc wrote:

Can't this be done in parallel? If the OSD doesn't have an object then

it is a noop and should be pretty quick. The number of outstanding

operations can be limited to 100 or a 1000 which would provide a

balance between speed and performance impact if there is data to be

trimmed. I'm not a big fan of a "--skip-trimming" option as there is

the potential to leave some orphan objects that may not be cleaned up

correctly.

Yeah, a --skip-trimming option seems a bit dangerous. This trimming

actually is parallelized (10 ops at once by default, changeable via

--rbd-concurrent-management-ops) since dumpling.

What will really help without being dangerous is keeping a map of

object existence [1]. This will avoid any unnecessary trimming

automatically, and it should be possible to add to existing images.

It should be in hammer.

Josh

[1] https://github.com/ceph/ceph/pull/2700

On Tue, Jan 6, 2015 at 8:09 AM, Jake Young <jak3kaj@xxxxxxxxx> wrote:

On Monday, January 5, 2015, Chen, Xiaoxi <xiaoxi.chen@xxxxxxxxx> wrote:

When you shrinking the RBD, most of the time was spent on

librbd/internal.cc::trim_image(), in this function, client will iterator

all

unnecessary objects(no matter whether it exists) and delete them.

So in this case,  when Edwin shrinking his RBD from 650PB to 650GB,

there are[ (650PB * 1024GB/PB -650GB) * 1024MB/GB ] / 4MB/Object =

170,227,200 Objects need to be deleted.That will definitely take a long

time

since rbd client need to send a delete request to OSD, OSD need to find

out

the object context and delete(or doesn’t exist at all). The time needed

to

trim an image is ratio to the size needed to trim.

make another image of the correct size and copy your VM's file system to

the new image, then delete the old one will  NOT help in general, just

because delete the old volume will take exactly the same time as

shrinking ,

they both need to call trim_image().

The solution in my mind may be we can provide a “—skip-triming” flag to

skip the trimming. When the administrator absolutely sure there is no

written have taken place in the shrinking area(that means there is no

object

created in these area), they can use this flag to skip the time

consuming

trimming.

How do you think?

That sounds like a good solution. Like doing "undo grow image"

From: Jake Young [mailto:jak3kaj@xxxxxxxxx]

Sent: Monday, January 5, 2015 9:45 PM

To: Chen, Xiaoxi

Cc: Edwin Peer; ceph-users@xxxxxxxxxxxxxx

Subject: Re:  rbd resize (shrink) taking forever and a day

On Sunday, January 4, 2015, Chen, Xiaoxi <xiaoxi.chen@xxxxxxxxx> wrote:

You could use rbd info <volume_name>  to see the block_name_prefix, the

object name consist like <block_name_prefix>.<sequence_number>,  so for

example, rb.0.ff53.3d1b58ba.00000000e6ad should be the <e6ad>th object

of

the volume with block_name_prefix rb.0.ff53.3d1b58ba.

       $ rbd info huge

          rbd image 'huge':

           size 1024 TB in 268435456 objects

           order 22 (4096 kB objects)

           block_name_prefix: rb.0.8a14.2ae8944a

           format: 1

-----Original Message-----

From: ceph-users [mailto:ceph-users-bounces@lists.ceph.com] On Behalf Of

Edwin Peer

Sent: Monday, January 5, 2015 3:55 AM

To: ceph-users@xxxxxxxxxxxxxx

Subject: Re:  rbd resize (shrink) taking forever and a day

Also, which rbd objects are of interest?

<snip>

ganymede ~ # rados -p client-disk-img0 ls | wc -l

1672636

</snip>

And, all of them have cryptic names like:

rb.0.ff53.3d1b58ba.00000000e6ad

rb.0.6d386.1d545c4d.000000011461

rb.0.50703.3804823e.000000001c28

rb.0.1073e.3d1b58ba.00000000b715

rb.0.1d76.2ae8944a.00000000022d

which seem to bear no resemblance to the actual image names that the rbd

command line tools understands?

Regards,

Edwin Peer

On 01/04/2015 08:48 PM, Jake Young wrote:

On Sunday, January 4, 2015, Dyweni - Ceph-Users

<6EXbab4FYk8H@xxxxxxxxxx <mailto:6EXbab4FYk8H@dyweni.com>> wrote:

      Hi,

      If its the only think in your pool, you could try deleting the

      pool instead.

      I found that to be faster in my testing; I had created 500TB when

      I meant to create 500GB.

      Note for the Devs: I would be nice if rbd create/resize would

      accept sizes with units (i.e. MB GB TB PB, etc).

      On 2015-01-04 08:45, Edwin Peer wrote:

          Hi there,

          I did something stupid while growing an rbd image. I

accidentally

          mistook the units of the resize command for bytes instead of

          megabytes

          and grew an rbd image to 650PB instead of 650GB. This all

happened

          instantaneously enough, but trying to rectify the mistake is

          not going

          nearly as well.

          <snip>

          ganymede ~ # rbd resize --size 665600 --allow-shrink

          client-disk-img0/vol-x318644f-0

          Resizing image: 1% complete...

          </snip>

          It took a couple days before it started showing 1% complete

          and has

          been stuck on 1% for a couple more. At this rate, I should be

          able to

          shrink the image back to the intended size in about 2016.

          Any ideas?

          Regards,

          Edwin Peer

          _______________________________________________

          ceph-users mailing list

          ceph-users@xxxxxxxxxxxxxx

          http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

      _______________________________________________

      ceph-users mailing list

      ceph-users@xxxxxxxxxxxxxx

      http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

You can just delete the rbd header. See Sebastien's excellent blog:

http://www.sebastien-han.fr/blog/2013/12/12/rbd-image-bigger-than-your

-ceph-cluster/

Jake

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Sorry, I misunderstood.

The simplest approach to me is to make another image of the correct size

and copy your VM's file system to the new image, then delete the old

one.

The safest thing to do would be to mount the new file system from the VM

and do all the formatting / copying from there (the same way you'd move

a

physical server's root disk to a new physical disk)

I would not attempt to hack the rbd header. You open yourself up to some

unforeseen problems.

Unless one of the ceph developers can comment there is a safe way to

shrink an image, assuming we know that the file system has not grown

since

growing the disk.

Jake

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com