Re: Submodule object store

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



hoi :)

Its really funny that when I proposed one big object database everybody
wanted it separated and now that I propose a separate database everybody
wants it as one combined database.
I read this as a sign that people really try to think critically about
the design, which is a good thing and will hopefully lead to a good
and stable submodule implementation.

On Mon, Mar 26, 2007 at 03:40:15PM -0800, David Lang wrote:
> useing the same object store makes this work automaticaly (think of all the 
> copies of COPYING that would end up being the same as a trivial example)

Yes, but I guess not much more than COPYING, INSTALL, some trivial
Makefiles and empty files will be shared between subprojects.
Except when you have the same subproject in your tree multiple times,
of course.

Yet this sharing is exactly why I started to do it that way, until Linus
stopped me.

> >If someone comes up with a nice way to handle everything in one big
> >object store I would happily use that! :-)
> 
> what exactly are the problems with one big object store?

I think we really have to discuss this separation on several layers:
traversal, pack-files, and object database.

For the traversal the point of separating it into a per-module traversal
is that only one module has to be loaded into RAM at a time.
This effects all operations which do a (potentially) recursive traversal:
push, pull, fsck, prune, repack.
However a separated traversal will no longer be garanteed to only list
an object once, so this has to be handled in some way.

Pack files should have better access patterns if they are per-module.
Most of the time you are only interested in one individual module and
locality is important here.

Separating the entire object database is a way to improve unreachability
analysis, as it now can be done per module.
The other two separations are easier to implement with a separated
object database, but that's not too strong an argument.


So if we can come up with a nice way to do unreachability analysis we
can indeed go on with the shared object database and tackle the
remaining scalability issues as they arise.  Those could then be added
later without changing the on-disk format.

> ones that I can think of:
> 
> 1. when you are doing a fsck you need to walk all the trees and find out 
> the list of objects that you know about.
> 
>   done as a tree of binary values you can hold a LOT in memory before 
>   running into swap.

Could you explain the algorithm you are thinking about in more detail?

>   if it's enough larger then available ram then an option for fsck to use 
>   trees on disk is an option.

This could simplify some things.
There could be an on-disk index of all known objects, so that the sha1
sums do not have to loaded into RAM all at once.

-- 
Martin Waitz

Attachment: signature.asc
Description: Digital signature


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]