I'll concede that I cannot duplicate this in Jewel. When I was seeing this, it was in Hammer and I was 100% able to duplicate it with empty RBDs, RBDs filled with /dev/zero, and RBDs filled with /dev/random. I could duplicate an n^2 time difference in `time rbd rm test`. We mapped it all the way from 1GB RBDs to 1TB RBDs with 1MB objects, and we never found an anomaly. We even mapped it with the same size RBDs using a different order to change the amount of objects the RBD had and it was the exact same scale.
I'll stop saying it now as it seems to have been addressed.
On Fri, Jun 30, 2017 at 5:14 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
On Fri, Jun 30, 2017 at 2:07 PM, David Turner <drakonstein@xxxxxxxxx> wrote:
> That comes from using Ceph. I've just done lots of deleting of large
> amounts of data and paid attention to how long things took to delete. If
> you don't believe me, I gave steps that you responded to to duplicate it. I
> haven't asked a Ceph dev or bothered to look through the code, but every
> time I delete something it seems to match the pattern. A very prominent
> time comes to mind where a 1TB volume took about 1 hour to delete and a 4 TB
> volume took about 16 hours to delete. It was incredibly obnoxious as I had
> almost 1PB of RBDs to clean up and every single one matched this pattern of
> n^2.
>
> My particular experience with the snap_trim_q comes from optimizing a
> cluster that deletes over 6k snapshots every day. I watched how long it
> took for the snap_trim_q to drop after we worked with the Ceph devs to fix a
> problem in our cluster where a snap trim op got into a bad state and wasn't
> able to clean itself up. This has been fixed in the code and backported
> now. But every time one of the PGs with the problem got to that operation,
> it segfaulted the OSD. We had to run with snap trimming disabled on half a
> dozen OSDs for over a month and then I graphed the progress for the OSD
> cleaning up it's snap_trim_q. It exponentially got faster as it cleared up
> it's queue.
>
> I did bring it up with the devs and other sysadmins at my company and none
> of us could even think of how to write an n^2 delete function, let alone why
> it would be necessary, but who were we to judge... We assumed however Ceph
> deleted stuff in a very specific way that it needed to do it, so we just
> moved on.
I'm sure you've seen slow deletes, but there's nothing inherent to the
system that makes it an n-squared operation. Far more likely you're
simply running into PG contention and the long tails that tend to crop
up in unexpected places in distributed systems. Stuff getting faster
as you run repeated snapshot trimming isn't surprising either, as it
can pull more of the snapshot metadata/data into cache and keep it
around. (Also, there's less extraneous metadata it needs to scan
through in leveldb to get to the real work — that's the only thing
that can even approximate a non-linear operation, and even it would
depend a great deal on exactly what other kinds of load you're
applying and the exact state of the OSD's local leveldb instance.)
-Greg
>
> On Fri, Jun 30, 2017 at 4:48 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>>
>> On Fri, Jun 30, 2017 at 1:24 PM, David Turner <drakonstein@xxxxxxxxx>
>> wrote:
>> > When you delete a snapshot, Ceph places the removed snapshot into a list
>> > in
>> > the OSD map and places the objects in the snapshot into a snap_trim_q.
>> > Once
>> > those 2 things are done, the RBD command returns and you are moving onto
>> > the
>> > next snapshot. The snap_trim_q is an n^2 operation (like all deletes in
>> > Ceph), which means that if the queue has 100 objects on it and takes 5
>> > minutes to complete, then having 200 objects in the queue will take 25
>> > minutes.
>>
>> You keep saying deletes are an n-squared operation but I don't really
>> have any idea where that's coming from. Could you please elaborate? :)
>>
>> > (exaggerated time frames to show math) This same behavior can be
>> > seen when deleting an RBD that has 100,000 objects vs 200,000 objects,
>> > it
>> > takes twice as long (note that object map mitigates this greatly by
>> > ignoring
>> > any object that hasn't been created, so the previous test would be
>> > easiest
>> > to duplicate by disabling the object map on the test RBDs).
>> >
>> > So paying attention to snapshot sizes as you clean them up is more
>> > important
>> > than how many snapshots you clean up. Being on Jewel, you don't really
>> > want
>> > to use osd_snap_trim_sleep as it literally puts a sleep onto the main op
>> > threads for the OSD. In Hammer this setting was much more useful (if
>> > not
>> > super hacky) and in Luminous the entire process was revamped and
>> > (hopefully)
>> > fixed. Jewel is pretty much not viable for large quantities of
>> > snapshots,
>> > but there are ways to get through them.
>> >
>> > The following thread on the ML is one of the most informative on this
>> > problem in Jewel. The second link is the resuming of the thread months
>> > later after the fix was scheduled for backporting into 10.2.8.
>> >
>> >
>> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-January/015675.html
>> >
>> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-April/017697.html
>> >
>> > On Fri, Jun 30, 2017 at 4:02 PM Kenneth Van Alstyne
>> > <kvanalstyne@xxxxxxxxxxxxxxx> wrote:
>> >>
>> >> Hey folks:
>> >> I was wondering if the community can provide any advice — over
>> >> time and due to some external issues, we have managed to accumulate
>> >> thousands of snapshots of RBD images, which are now in need of cleaning
>> >> up.
>> >> I have recently attempted to roll through a “for" loop to perform a
>> >> “rbd
>> >> snap rm” on each snapshot, sequentially, waiting until the rbd command
>> >> finishes before moving onto the next one, of course. I noticed that
>> >> shortly
>> >> after starting this, I started seeing thousands of slow ops and a few
>> >> of our
>> >> guest VMs became unresponsive, naturally.
>>
>> In addition to the thread David linked to, I gave a talk about
>> snapshot trimming and capacity planning which may be helpful:
>> https://www.youtube.com/watch?v=rY0OWtllkn8
>> If you read the whole thread I'm not sure there's any new data in that
>> talk, but it is hopefully a little more organized/understandable. :)
>> -Greg
>>
>> >>
>> >> My questions are:
>> >> - Is this expected behavior?
>> >> - Is the background cleanup asynchronous from the “rbd snap rm”
>> >> command?
>> >> - If so, are there any OSD parameters I can set to
>> >> reduce
>> >> the impact on production?
>> >> - Would “rbd snap purge” be any different? I expect not, since
>> >> fundamentally, rbd is performing the same action that I do via the
>> >> loop.
>> >>
>> >> Relevant details are as follows, though I’m not sure cluster size
>> >> *really*
>> >> has any effect here:
>> >> - Ceph: version 10.2.5
>> >> (c461ee19ecbc0c5c330aca20f7392c9a00730367)
>> >> - 5 storage nodes, each with:
>> >> - 10x 2TB 7200 RPM SATA Spindles (for a total of 50
>> >> OSDs)
>> >> - 2x Samsung MZ7LM240 SSDs (used as journal for the
>> >> OSDs)
>> >> - 64GB RAM
>> >> - 2x Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz
>> >> - 20GBit LACP Port Channel via Intel X520 Dual Port
>> >> 10GbE
>> >> NIC
>> >>
>> >> Let me know if I’ve missed something fundamental.
>> >>
>> >> Thanks,
>> >>
>> >> --
>> >> Kenneth Van Alstyne
>> >> Systems Architect
>> >> Knight Point Systems, LLC
>> >> Service-Disabled Veteran-Owned Business
>> >> 1775 Wiehle Avenue Suite 101 | Reston, VA 20190
>> >> c: 228-547-8045 f: 571-266-3106
>> >> www.knightpoint.com
>> >> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track
>> >> GSA Schedule 70 SDVOSB: GS-35F-0646S
>> >> GSA MOBIS Schedule: GS-10F-0404Y
>> >> ISO 20000 / ISO 27001 / CMMI Level 3
>> >>
>> >> Notice: This e-mail message, including any attachments, is for the sole
>> >> use of the intended recipient(s) and may contain confidential and
>> >> privileged
>> >> information. Any unauthorized review, copy, use, disclosure, or
>> >> distribution
>> >> is STRICTLY prohibited. If you are not the intended recipient, please
>> >> contact the sender by reply e-mail and destroy all copies of the
>> >> original
>> >> message.
>> >>
>> >> _______________________________________________
>> >> ceph-users mailing list
>> >> ceph-users@xxxxxxxxxxxxxx
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> >
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@xxxxxxxxxxxxxx
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com