Re: efficient removal of old objects

Gregory Farnum <gregory.farnum@xxxxxxxxxxxxx> · Wed, 1 Feb 2012 11:35:48 -0800

On Wed, Feb 1, 2012 at 10:53 AM, Yehuda Sadeh Weinraub
<yehudasa@xxxxxxxxx> wrote:
> On Wed, Feb 1, 2012 at 9:39 AM, Gregory Farnum
> <gregory.farnum@xxxxxxxxxxxxx> wrote:
>> You are dramatically overstating the impact of latency on an
>> inherently parallelizable and non-interactive operation. A couple disk
>> seeks *do not matter.*
>
> Do not matter to whom? It affects the overall osd performance, and
> given enough threads going on in parallel doing the cleanup, it
> *really* matters, and this is the basic issue.
You can impact the random lookups based on how the intent log is
actually designed, to the point that those lookups should not impact
the OSDs noticeably (a few hundred requests per day to do once-a-day
cleanups). Once you have the information on who to delete, you are
going to run the same sequence of operations; the only question is
whether they originate on the client or on the OSD. Assuming a large
load (as you are) and a not-trivial PG, the deletes are all going to
have to go to disk to find the inodes anyway. So the vast majority of
the load required for a client-side solution is identical to the load
required for a scrub-based solution.

>>> Note that setting expiration on an object is a more lightweight
>>> operation than appending the intent log, as it would be done as a sub
>>> op in the compound operation that created the object.
>>
>> ...you're going to set expirations on the objects when you write them?
>> What if the user's upload takes longer than you expect?
>
> You're a few months too late. Go back to the atomic get/put discussion.
I remember this discussion, but I thought we'd ended up setting
intent-to-delete when we did the final clone into place?

>>>> I think there are a couple questions:
>>>>
>>>> Should this be generalized to saying "do these osd ops at time X" instead
>>>> of "delete at time X".  Then it could setxattr, remove, call into a class,
>>>> whatever.
>>>
>>> While I think it'd make a nice feature, I also think that the problem
>>> space of a garbage collection is a bit different, and given the time
>>> constraints it wouldn't make sense implementing this right now anyway.
>>
>> This is client-side garbage collection, not RADOS garbage collection.
>> Don't confuse those issues, either — the second is appropriate to put
>
> No. This is a garbage collection utility that RADOS can provide.

Yes, it *can*, but that doesn't mean it *should*. Core OSD
functionality should be stuff that's widely-used by many clients, and
I can't think of any other client that's going to want time-based
garbage collection of this sort. Every other scenario I can think of
will just delete when they are done with the object, or else will want
more sophisticated checks than the amount of elapsed time. Which is
why I support the eventual addition of time-based class triggers, but
not an interface tailored exclusively for radosgw.

>>>> How would the OSD implement this?  A kludgey way would be to do it during
>>>> scrub.  The current scrub implementation may make that problematic because
>>>> it does a whole PG at time, and we probably don't want to issue a whole
>>>> PG's worth of deletes at a time.  Is there a way to make that less
>>>> painful?
>>>
>>> If we need to lock the entire pg while removing the objects it wouldn't work.
>>
>> That's how scrub works right now...
>>
>>> I'm not too familiar with the scrub code, and I don't want to dive
>>> here into possible implementation details, but getting the scrub to
>>> generate a list of objects for removal may be possible.
>>
>> Sam and I tossed around a few ideas for how to do this, and it's not
>> impossible, but it was significantly more complicated than everybody
>> thinks it is at first glance. (You need to make sure that it doesn't
>> interact with recovery at all, which means it needs to go through the
>> normal request mechanism, which means you need to build up a queue of
>> deletes while scrubbing and then dispatch it properly without
>> disrupting client requests or running out of memory; you need to make
>> sure that scrubbing runs more reliably than it does right now...etc
>> etc)
>
> I think you're overstating complexity. We're already disrupting client
> requests by running the cleanup externally. Leveraging scrub
> throttling due to system load is a strength, not a weakness. If any,
> using the intent log cleanup blindly is a real issue. Also, add to
> that the fact that we leverage the fact that scrub runs over the
> objects anyway and heats up the caches, the performance gain we'd get
> is much bigger.
I thought the whole reason this had suddenly become such an issue is
because not cleaning up the intent log stuff has a performance impact
on the cluster. Scrub *doesn't run* when the load is too high...which
means that by leveraging scrub you will get into a circle of death
where cleanup never occurs because the load is too high, which causes
the load to continue increasing...

>> The problem is that it changes the nature of scrub. Right now, scrub
>> doesn't change anything at all; scrub repair sets the replicas to have
>
> It doesn't change anything with scrub's nature if it's only used to
> generate the list of objects to remove (per pg).

If your contention is that doing delayed work immediately following a
scrub is a huge performance win, then we ought to be able to hang more
than deletes off of scrub. Adding deletes now with the intention of
expanding it later creates an interface and code maintenance nightmare
— we either maintain two parallel code tracks or else we have to
convert old-style delete requests to new-style interface requests.
Either way, eww! This is directly contrary to the work we're doing
with message encoding et al to work towards more stable interfaces.

>> the same state as the primary. You want to add a client-controlled
>> state mutation that is triggered as part of scrub, which *really*
>> makes it different (and complicated)...or else it's a hack to have
>> scrub trigger some weird sequences of requests (as I outlined above).
>> Either way, it's a big change to scrub that smells hacky.
>>
>> The basic issue here is that the RGW stuff can all be done as
>> client-side operations, and all you've demonstrated is that doing it
>> serially with a single client is slow (but not slower than the
>> generation of the objects, which means that it does actually work).
>> The correct response to that is not to add half-baked features to the
>> OSD; the correct response is to make your client behave well.
>
> The problem is not the client behaving well, but the impact that it
> has on the overall system performance due to random seeks.

Maybe you've presented data on this to somebody, but the group hasn't
seen it. Please do show and tell! And demonstrate that the performance
impact is inherent in a client-based solution, rather than in the way
it's currently implemented. :)

>> If we *do* want to add time-based triggers that clients can set up,
>> that ought to be a well-thought-out interface that isn't limited to a
>> single use-case. I'm totally fine with the idea, as long as it comes
>> at some point in the future when we aren't all working hard to
>> stabilize the core system.
>
> We may create that sometime in the future, and implement garbage
> collection using that. But you're failing to understand the point that
> using the scrub is just an implementation detail.
Hanging it off of scrub is not just an implementation detail — if you
do it without scrub, then the work becomes dramatically more complex
and the fact that it does deletes instead of arbitrary code execution
is just an implementation detail. My opposition is to both the
implementation and the interface (that we have to carry forever).

> I do think that we
> need object expiration in rados. This is not a single use case.
> I also think that using external client for that is a mistake
> (performance on one hand, but also adding administrative pain).

I can't find object expiration anywhere except in S3, and that
interface is very clearly about end-user ease of use rather than
pushing the expiration into the object store. The only use case they
can come up with is logs stored in objects, and expiration generates
explicit bucket access logs which makes it look to me like it's run as
a separate process using bucket scanning. *shrug*
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html