Re: Submodule object store

David Lang <david.lang@xxxxxxxxxxxxxxxxxx> · Mon, 26 Mar 2007 15:40:15 -0800 (PST)

On Tue, 27 Mar 2007, Martin Waitz wrote:

hoi :)

On Mon, Mar 26, 2007 at 03:20:34PM -0800, David Lang wrote:
I want to be able to list all objects which are not reachable in the
object store, without traversing all submodules at the same time.
The only way I can think of to achieve this is to have one separate
object store per submodule and then do the traversal per submodule.

why do you want to optimize for the relativly rare fsck function rather
then the more common read functions (which would benifit from shareing
object that are identical between projects)?

Because I don't know how to make it _possible_ for large repositories
otherwise.  Consider a Linux-distribution which handles each package
as one submodule.

I don't think that it's too much balanced towards fsck.
The separated object store also helps reduce the memory requirement for
large pushs/pulls.
Sharing objects can be achieved by alternates if you want.

alternates require explicitly setting up the sharing.

useing the same object store makes this work automaticaly (think of all the 
copies of COPYING that would end up being the same as a trivial example)

If someone comes up with a nice way to handle everything in one big
object store I would happily use that! :-)

what exactly are the problems with one big object store?

ones that I can think of:

1. when you are doing a fsck you need to walk all the trees and find out the 
list of objects that you know about.

  done as a tree of binary values you can hold a LOT in memory before running 
into swap.

  if it's enough larger then available ram then an option for fsck to use trees 
on disk is an option.

2. when creating a pack you will eventually run into pack-size limits with too 
many objects

  teach the pack creators to make packs that are subsets rather then everything 
(I belive that most of the smarts are there, it just needs the upper control 
logic to tell the existing things what to include)

3. when doing a pull it takes longer to figure out what to pull to get a 
duplicate of _everything_

  add a way to do a 'pull projectlist' that would look at what objects are 
needed by the project(s) requested and only try to pack up those objects

what else is there that I'm not thinking of? so far these look like long-term 
problems as opposed to short-term problems, and all of them have fairly simple 
fixes that can be implemented as they become an issue.

David Lang
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html