Re: Better repodata performance

Jeff Pitman <symbiont@xxxxxxxxxx> · Tue, 1 Feb 2005 15:50:43 +0800

On Monday 31 January 2005 12:27, Alexandre Oliva wrote:
> On Jan 31, 2005, Jeff Pitman <symbiont@xxxxxxxxxx> wrote:
> > This could be driven by an optional parameter to createrepo, which
> > provides a list of packages to create a delta with.
>
> Err...  Why?  We already have repodata/, and we're creating the new
> version in .repodata.  We can use repodata/ however we like, I think.

Because, the way I'd implement it would not use a binary diff, such as 
xdelta.  See, you're thinking at the level of createrepo crunching on 
the entire thing over again.  I'm not.  I'm thinking about it from a 
certain subset of packages driven by a parameter. 

>From a gcc/make analogy viewpoint, you can view this as updating one or 
two specs and running make on the whole source tree.  Since you already 
have objects built, you rebuild the few you don't have, and relink at 
the very end. Here, the maintainer of the repo wins. Now, if we 
deferred the re-link to the user-end, then the download for the user 
would win, too.

> > I would rather not utilize xdelta, because you're still
> > regenerating the entire thing.  Having xmlets that virtually
> > add/substract as a delta against primary.xml.gz would be optimal
> > for both sides of the equation.
>
> But then Seth rejects the idea because it makes for unmaintainable
> code.  And I sort of agree with him now that I see a simpler way to
> accomplish the same bandwidth savings.

You got me.  Not sure how the level of difficulty has changed at all.  
But, a couple of implementations wouldn't hurt.  Shoot, it all might 
not save *anything*. Doing it first, then throwing it out is what we 
need now.  If it works, great.  If not, *shrug*, we live and learn.

> > Another advantage of the delta method, is that the on-disk pickled
> > objects (or whatever back-end store is used) could be updated
> > incrementally based on xml snippets coming in. Instead of
> > regenerating the whole thing over again.
>
> This is certainly a good point, but it is also trickier to get right.
> And it might also turn out to be bigger: if you have to list what
> went away, you're probably emitting more information than xdelta's
> `skip these many bytes'.  

I would never go to this level of madness.  My proposal is connected 
with generating Xml necessary for the job, not a low-level binary diff 
between two runs of createrepo.  Before doing it like this, I'd explore 
the librsync option and just run createrepo once as usual and rsync 
transfer the diff across the line.  Keeping track of two runs is, 
although a bit novel, a little too much.

thanks,

-- 
-jeff