Re: GSoC 2009 Prospective student

Jakub Narebski <jnareb@xxxxxxxxx> · Tue, 24 Feb 2009 22:08:11 +0100

On Tue, 24 Feb 2009, Shawn O. Pearce wrote:
> Jakub Narebski <jnareb@xxxxxxxxx> wrote:
> > 
> > I (and Nicolas) by 'sorting order' mean here ordering of objects and
> > deltas in the pack file, i.e. whether we get _exactly_ the same (byte
> > for byte) packfile for the same want/have exchange (your proposal), or
> > even for the same arguments to git-pack-objects (which is a necessary,
> > although I think not sufficient condition).
> 
> I know.
> 
> My proposal though didn't require the same byte-for-byte pack file.
> Only that the objects were in a predictable order.  It didn't permit
> resuming in the middle of an object.  If the last object in the pack
> was truncated the client would resume by getting that object again,
> and may get a different byte sequence for that object representation.

Ah, so you meant skipping first N _objects_, and not first N _bytes_
of a re-generated pack.  That's better.

Although in the case when packfiles are cached, I think you can support
resuming on a byte.  But I guess only in such case (where exactly
byte-for-byte the same packfile is resend / reused).

> 
> Its a b**ch to know where you stopped though, as you could be in
> a long string of deltas whose base is in the portion you didn't
> yet receive.  Which means you can't identify that string that you
> already have, and pack-objects on resume can't assume you have
> those objects, because you only have the deltas for them and are
> lacking a way to restore them.

Moreover from what I understand the want/have exchange is about 
_commits_, and it assumes that if you 'have' a commit, you have all
its ancestors, and all trees (including those of ancestors), and all
blobs (including those of ancestors).  Not only delta without base.
Besides if I remember correctly we always write base before delta; or
am I mistaken here?

But one could take a look at patches (present in git mailing list
archive) which tried to add 'lazy clone' / 'remote alternates' support.
IIRC there was 'haveonly' extension to exchange protocol, which was
to meant that you have (in full) only given object, but not necessary
its prerequisites.  Then you can filter out those 'haveonly' objects
from list of objects to pack fed to git-pack-object, isn't it?

> 
> > Can we assume that packfiles are named correctly (i.e. name of packfile
> > match SHA-1 footer)?
> 
> Wrong.
> 
> The hash in "pack-$hash.pack"/"pack-$hash.idx" is *not* the 20 byte
> SHA-1 footer.  Its the 20 byte SHA-1 of the sorted object names who
> are in that pack.
> 
> We should try not to assume that the pack's file name matches the
> sorted object names, but we can assume that the pack file name is
> "pack-$hash.pack" where $hash is a 40 character hexadecimal string.
> The dumb commit walkers already have this restriction built into
> them, and have for quite some time.
> 
> Any pack writers, including fast-import, honor this naming standard
> in order to ensure they are compatible with the existing dumb
> commit walkers.

Ah. So it is a _bit_ harder (for "dumb" protocols) than I thought.
Still much easier than resumable clone for smart (pack generating)
protocols.

>  
> > Therefore I think that restartable clone for "dumb" (commit walker)
> > protocols is easy GSoC project, while restartable clone for "smart"
> > (generate packfile) protocols is at least of medium difficulty, and
> > might be harder.
> 
> Probably quite right.  Unfortunately the majority of the git
> repositories out there are served with the smart protocol, because
> it is more efficient.  :)

Long, long time ago rsync:// protocol was recommended for initial clone.
It has serious disadvantage of possibly returning silently corrupted
repository, as it didn't ensure that references and objects were fetched
in correct sequence, and is thus deprecated, and support for it
bit-rotten ;) in places...

I wonder if it is possible to make rsync:// more robust...

[...]
> > I'll try to add 'pack file cache for git-daemon' proposal to 
> > GSoC2009Ideas page... but I cannot be mentor (or even co-mentor) for
> > this idea.
> 
> The pack file cache project is likely easier than restarting a
> pack file.  Especially in the face of the threaded delta code.
> 
> There are difficult details about making the cache secure so we can't
> overwrite repository data due to a buffer overflow.  Or making
> the cache prune itself so it doesn't run out of disk.  Etc.
> We've talked about a cache before on list.

Well, this is _cache_. OTOH having pack cache would make it easy to have
resumable clone if you hit one of cached packfiles on resume...

On the other hand I wonder what improvements it would give, as generating
packs with delta reuse is, I think, quite fast...

-- 
Jakub Narebski
Poland
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html