Re: 回复: Re: rbd resize (shrink) taking forever and a day

Sage Weil <sage@xxxxxxxxxxxx> · Wed, 7 Jan 2015 08:33:37 -0800 (PST)

On Tue, 6 Jan 2015, Chen, Xiaoxi wrote:
> it is already in parallel, the outstanding ops are limited to ~10 per 
> client(tuneable),enlarge this may help.
> 
> BUut pls note that there is no noop here, OSD has no idea wherher it has 
> an object until it failed to find it in the disk, that means the op had 
> almost traveled the code path.

Also keep in mind that the new object map stuff we're about to merge for 
hammer makes this problem go away.  From hammer onwards we'll know which 
objects exist and will only try to delete (or export, or clone, or 
read) ones that exist.

sage

> 
> ---- Robert LeBlanc?? ----
> 
> > Can't this be done in parallel? If the OSD doesn't have an object then
> > it is a noop and should be pretty quick. The number of outstanding
> > operations can be limited to 100 or a 1000 which would provide a
> > balance between speed and performance impact if there is data to be
> > trimmed. I'm not a big fan of a "--skip-trimming" option as there is
> > the potential to leave some orphan objects that may not be cleaned up
> > correctly.
> > 
> > On Tue, Jan 6, 2015 at 8:09 AM, Jake Young <jak3kaj@xxxxxxxxx> wrote:
> > >
> > >
> > > On Monday, January 5, 2015, Chen, Xiaoxi <xiaoxi.chen@xxxxxxxxx> wrote:
> > >>
> > >> When you shrinking the RBD, most of the time was spent on
> > >> librbd/internal.cc::trim_image(), in this function, client will iterator all
> > >> unnecessary objects(no matter whether it exists) and delete them.
> > >>
> > >>
> > >>
> > >> So in this case,  when Edwin shrinking his RBD from 650PB to 650GB,
> > >> there are[ (650PB * 1024GB/PB -650GB) * 1024MB/GB ] / 4MB/Object =
> > >> 170,227,200 Objects need to be deleted.That will definitely take a long time
> > >> since rbd client need to send a delete request to OSD, OSD need to find out
> > >> the object context and delete(or doesn?t exist at all). The time needed to
> > >> trim an image is ratio to the size needed to trim.
> > >>
> > >>
> > >>
> > >> make another image of the correct size and copy your VM's file system to
> > >> the new image, then delete the old one will  NOT help in general, just
> > >> because delete the old volume will take exactly the same time as shrinking ,
> > >> they both need to call trim_image().
> > >>
> > >>
> > >>
> > >> The solution in my mind may be we can provide a ??skip-triming? flag to
> > >> skip the trimming. When the administrator absolutely sure there is no
> > >> written have taken place in the shrinking area(that means there is no object
> > >> created in these area), they can use this flag to skip the time consuming
> > >> trimming.
> > >>
> > >>
> > >>
> > >> How do you think?
> > >
> > >
> > > That sounds like a good solution. Like doing "undo grow image"
> > >
> > >
> > >>
> > >>
> > >> From: Jake Young [mailto:jak3kaj@xxxxxxxxx]
> > >> Sent: Monday, January 5, 2015 9:45 PM
> > >> To: Chen, Xiaoxi
> > >> Cc: Edwin Peer; ceph-users@xxxxxxxxxxxxxx
> > >> Subject: Re:  rbd resize (shrink) taking forever and a day
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Sunday, January 4, 2015, Chen, Xiaoxi <xiaoxi.chen@xxxxxxxxx> wrote:
> > >>
> > >> You could use rbd info <volume_name>  to see the block_name_prefix, the
> > >> object name consist like <block_name_prefix>.<sequence_number>,  so for
> > >> example, rb.0.ff53.3d1b58ba.00000000e6ad should be the <e6ad>th object  of
> > >> the volume with block_name_prefix rb.0.ff53.3d1b58ba.
> > >>
> > >>      $ rbd info huge
> > >>         rbd image 'huge':
> > >>          size 1024 TB in 268435456 objects
> > >>          order 22 (4096 kB objects)
> > >>          block_name_prefix: rb.0.8a14.2ae8944a
> > >>          format: 1
> > >>
> > >> -----Original Message-----
> > >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> > >> Edwin Peer
> > >> Sent: Monday, January 5, 2015 3:55 AM
> > >> To: ceph-users@xxxxxxxxxxxxxx
> > >> Subject: Re:  rbd resize (shrink) taking forever and a day
> > >>
> > >> Also, which rbd objects are of interest?
> > >>
> > >> <snip>
> > >> ganymede ~ # rados -p client-disk-img0 ls | wc -l
> > >> 1672636
> > >> </snip>
> > >>
> > >> And, all of them have cryptic names like:
> > >>
> > >> rb.0.ff53.3d1b58ba.00000000e6ad
> > >> rb.0.6d386.1d545c4d.000000011461
> > >> rb.0.50703.3804823e.000000001c28
> > >> rb.0.1073e.3d1b58ba.00000000b715
> > >> rb.0.1d76.2ae8944a.00000000022d
> > >>
> > >> which seem to bear no resemblance to the actual image names that the rbd
> > >> command line tools understands?
> > >>
> > >> Regards,
> > >> Edwin Peer
> > >>
> > >> On 01/04/2015 08:48 PM, Jake Young wrote:
> > >> >
> > >> >
> > >> > On Sunday, January 4, 2015, Dyweni - Ceph-Users
> > >> > <6EXbab4FYk8H@xxxxxxxxxx <mailto:6EXbab4FYk8H@xxxxxxxxxx>> wrote:
> > >> >
> > >> >     Hi,
> > >> >
> > >> >     If its the only think in your pool, you could try deleting the
> > >> >     pool instead.
> > >> >
> > >> >     I found that to be faster in my testing; I had created 500TB when
> > >> >     I meant to create 500GB.
> > >> >
> > >> >     Note for the Devs: I would be nice if rbd create/resize would
> > >> >     accept sizes with units (i.e. MB GB TB PB, etc).
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >     On 2015-01-04 08:45, Edwin Peer wrote:
> > >> >
> > >> >         Hi there,
> > >> >
> > >> >         I did something stupid while growing an rbd image. I
> > >> > accidentally
> > >> >         mistook the units of the resize command for bytes instead of
> > >> >         megabytes
> > >> >         and grew an rbd image to 650PB instead of 650GB. This all
> > >> > happened
> > >> >         instantaneously enough, but trying to rectify the mistake is
> > >> >         not going
> > >> >         nearly as well.
> > >> >
> > >> >         <snip>
> > >> >         ganymede ~ # rbd resize --size 665600 --allow-shrink
> > >> >         client-disk-img0/vol-x318644f-0
> > >> >         Resizing image: 1% complete...
> > >> >         </snip>
> > >> >
> > >> >         It took a couple days before it started showing 1% complete
> > >> >         and has
> > >> >         been stuck on 1% for a couple more. At this rate, I should be
> > >> >         able to
> > >> >         shrink the image back to the intended size in about 2016.
> > >> >
> > >> >         Any ideas?
> > >> >
> > >> >         Regards,
> > >> >         Edwin Peer
> > >> >         _______________________________________________
> > >> >         ceph-users mailing list
> > >> >         ceph-users@xxxxxxxxxxxxxx
> > >> >         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >> >
> > >> >     _______________________________________________
> > >> >     ceph-users mailing list
> > >> >     ceph-users@xxxxxxxxxxxxxx
> > >> >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >> >
> > >> >
> > >> > You can just delete the rbd header. See Sebastien's excellent blog:
> > >> >
> > >> > http://www.sebastien-han.fr/blog/2013/12/12/rbd-image-bigger-than-your
> > >> > -ceph-cluster/
> > >> >
> > >> > Jake
> > >> >
> > >> >
> > >> > _______________________________________________
> > >> > ceph-users mailing list
> > >> > ceph-users@xxxxxxxxxxxxxx
> > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>
> > >> _______________________________________________
> > >> ceph-users mailing list
> > >> ceph-users@xxxxxxxxxxxxxx
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >> _______________________________________________
> > >> ceph-users mailing list
> > >> ceph-users@xxxxxxxxxxxxxx
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>
> > >>
> > >>
> > >> Sorry, I misunderstood.
> > >>
> > >>
> > >>
> > >> The simplest approach to me is to make another image of the correct size
> > >> and copy your VM's file system to the new image, then delete the old one.
> > >>
> > >>
> > >>
> > >> The safest thing to do would be to mount the new file system from the VM
> > >> and do all the formatting / copying from there (the same way you'd move a
> > >> physical server's root disk to a new physical disk)
> > >>
> > >>
> > >>
> > >> I would not attempt to hack the rbd header. You open yourself up to some
> > >> unforeseen problems.
> > >>
> > >>
> > >>
> > >> Unless one of the ceph developers can comment there is a safe way to
> > >> shrink an image, assuming we know that the file system has not grown since
> > >> growing the disk.
> > >>
> > >>
> > >>
> > >> Jake
> > >
> > >
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com