The bitmap certainly sounds like it would help shortcut a lot of code that Xiaoxi mentions. Is the idea that the client caches the bitmap for the RBD so it know which OSDs to contact (thus saving a round trip to the OSD), or only for the OSD to know which objects exist on it's disk? On Tue, Jan 6, 2015 at 4:19 PM, Josh Durgin <josh.durgin@xxxxxxxxxxx> wrote: > On 01/06/2015 10:24 AM, Robert LeBlanc wrote: >> >> Can't this be done in parallel? If the OSD doesn't have an object then >> it is a noop and should be pretty quick. The number of outstanding >> operations can be limited to 100 or a 1000 which would provide a >> balance between speed and performance impact if there is data to be >> trimmed. I'm not a big fan of a "--skip-trimming" option as there is >> the potential to leave some orphan objects that may not be cleaned up >> correctly. > > > Yeah, a --skip-trimming option seems a bit dangerous. This trimming > actually is parallelized (10 ops at once by default, changeable via > --rbd-concurrent-management-ops) since dumpling. > > What will really help without being dangerous is keeping a map of > object existence [1]. This will avoid any unnecessary trimming > automatically, and it should be possible to add to existing images. > It should be in hammer. > > Josh > > [1] https://github.com/ceph/ceph/pull/2700 > > >> On Tue, Jan 6, 2015 at 8:09 AM, Jake Young <jak3kaj@xxxxxxxxx> wrote: >>> >>> >>> >>> On Monday, January 5, 2015, Chen, Xiaoxi <xiaoxi.chen@xxxxxxxxx> wrote: >>>> >>>> >>>> When you shrinking the RBD, most of the time was spent on >>>> librbd/internal.cc::trim_image(), in this function, client will iterator >>>> all >>>> unnecessary objects(no matter whether it exists) and delete them. >>>> >>>> >>>> >>>> So in this case, when Edwin shrinking his RBD from 650PB to 650GB, >>>> there are[ (650PB * 1024GB/PB -650GB) * 1024MB/GB ] / 4MB/Object = >>>> 170,227,200 Objects need to be deleted.That will definitely take a long >>>> time >>>> since rbd client need to send a delete request to OSD, OSD need to find >>>> out >>>> the object context and delete(or doesn’t exist at all). The time needed >>>> to >>>> trim an image is ratio to the size needed to trim. >>>> >>>> >>>> >>>> make another image of the correct size and copy your VM's file system to >>>> the new image, then delete the old one will NOT help in general, just >>>> because delete the old volume will take exactly the same time as >>>> shrinking , >>>> they both need to call trim_image(). >>>> >>>> >>>> >>>> The solution in my mind may be we can provide a “—skip-triming” flag to >>>> skip the trimming. When the administrator absolutely sure there is no >>>> written have taken place in the shrinking area(that means there is no >>>> object >>>> created in these area), they can use this flag to skip the time >>>> consuming >>>> trimming. >>>> >>>> >>>> >>>> How do you think? >>> >>> >>> >>> That sounds like a good solution. Like doing "undo grow image" >>> >>> >>>> >>>> >>>> From: Jake Young [mailto:jak3kaj@xxxxxxxxx] >>>> Sent: Monday, January 5, 2015 9:45 PM >>>> To: Chen, Xiaoxi >>>> Cc: Edwin Peer; ceph-users@xxxxxxxxxxxxxx >>>> Subject: Re: rbd resize (shrink) taking forever and a day >>>> >>>> >>>> >>>> >>>> >>>> On Sunday, January 4, 2015, Chen, Xiaoxi <xiaoxi.chen@xxxxxxxxx> wrote: >>>> >>>> You could use rbd info <volume_name> to see the block_name_prefix, the >>>> object name consist like <block_name_prefix>.<sequence_number>, so for >>>> example, rb.0.ff53.3d1b58ba.00000000e6ad should be the <e6ad>th object >>>> of >>>> the volume with block_name_prefix rb.0.ff53.3d1b58ba. >>>> >>>> $ rbd info huge >>>> rbd image 'huge': >>>> size 1024 TB in 268435456 objects >>>> order 22 (4096 kB objects) >>>> block_name_prefix: rb.0.8a14.2ae8944a >>>> format: 1 >>>> >>>> -----Original Message----- >>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of >>>> Edwin Peer >>>> Sent: Monday, January 5, 2015 3:55 AM >>>> To: ceph-users@xxxxxxxxxxxxxx >>>> Subject: Re: rbd resize (shrink) taking forever and a day >>>> >>>> Also, which rbd objects are of interest? >>>> >>>> <snip> >>>> ganymede ~ # rados -p client-disk-img0 ls | wc -l >>>> 1672636 >>>> </snip> >>>> >>>> And, all of them have cryptic names like: >>>> >>>> rb.0.ff53.3d1b58ba.00000000e6ad >>>> rb.0.6d386.1d545c4d.000000011461 >>>> rb.0.50703.3804823e.000000001c28 >>>> rb.0.1073e.3d1b58ba.00000000b715 >>>> rb.0.1d76.2ae8944a.00000000022d >>>> >>>> which seem to bear no resemblance to the actual image names that the rbd >>>> command line tools understands? >>>> >>>> Regards, >>>> Edwin Peer >>>> >>>> On 01/04/2015 08:48 PM, Jake Young wrote: >>>>> >>>>> >>>>> >>>>> On Sunday, January 4, 2015, Dyweni - Ceph-Users >>>>> <6EXbab4FYk8H@xxxxxxxxxx <mailto:6EXbab4FYk8H@xxxxxxxxxx>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> If its the only think in your pool, you could try deleting the >>>>> pool instead. >>>>> >>>>> I found that to be faster in my testing; I had created 500TB when >>>>> I meant to create 500GB. >>>>> >>>>> Note for the Devs: I would be nice if rbd create/resize would >>>>> accept sizes with units (i.e. MB GB TB PB, etc). >>>>> >>>>> >>>>> >>>>> >>>>> On 2015-01-04 08:45, Edwin Peer wrote: >>>>> >>>>> Hi there, >>>>> >>>>> I did something stupid while growing an rbd image. I >>>>> accidentally >>>>> mistook the units of the resize command for bytes instead of >>>>> megabytes >>>>> and grew an rbd image to 650PB instead of 650GB. This all >>>>> happened >>>>> instantaneously enough, but trying to rectify the mistake is >>>>> not going >>>>> nearly as well. >>>>> >>>>> <snip> >>>>> ganymede ~ # rbd resize --size 665600 --allow-shrink >>>>> client-disk-img0/vol-x318644f-0 >>>>> Resizing image: 1% complete... >>>>> </snip> >>>>> >>>>> It took a couple days before it started showing 1% complete >>>>> and has >>>>> been stuck on 1% for a couple more. At this rate, I should be >>>>> able to >>>>> shrink the image back to the intended size in about 2016. >>>>> >>>>> Any ideas? >>>>> >>>>> Regards, >>>>> Edwin Peer >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@xxxxxxxxxxxxxx >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@xxxxxxxxxxxxxx >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>>>> >>>>> You can just delete the rbd header. See Sebastien's excellent blog: >>>>> >>>>> http://www.sebastien-han.fr/blog/2013/12/12/rbd-image-bigger-than-your >>>>> -ceph-cluster/ >>>>> >>>>> Jake >>>>> >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@xxxxxxxxxxxxxx >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>>> >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>>> >>>> >>>> Sorry, I misunderstood. >>>> >>>> >>>> >>>> The simplest approach to me is to make another image of the correct size >>>> and copy your VM's file system to the new image, then delete the old >>>> one. >>>> >>>> >>>> >>>> The safest thing to do would be to mount the new file system from the VM >>>> and do all the formatting / copying from there (the same way you'd move >>>> a >>>> physical server's root disk to a new physical disk) >>>> >>>> >>>> >>>> I would not attempt to hack the rbd header. You open yourself up to some >>>> unforeseen problems. >>>> >>>> >>>> >>>> Unless one of the ceph developers can comment there is a safe way to >>>> shrink an image, assuming we know that the file system has not grown >>>> since >>>> growing the disk. >>>> >>>> >>>> >>>> Jake > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com