Hi, in the last weeks some ideas were floated here on the list to make git usable not only for source code tracking, but also for synchronizing mail storage folders or even complete file systems. Time to throw some code to the list: http://members.chello.at/replica/replica-20060220-1.tar.gz In advance of the description of the code, I want to point out some interesting properties of git. Git has some similarities with two distinct application areas: backup and synchronisation. Consider the object database alone, it contains snapshots in time of the state of the file system - exactly the same what a backup application does. Currently, git only holds a subset of the complete meta data of the files in it's database and should gain some support for directory meta data, non-regular files, statically linked files, and sparse files. But then it would be comparable to tar or other backup tools. Regarding the second point, synchronisation with others is clearly what git is all about. Furthermore, in contrast to synchronising file systems like coda, intermezzo and what ever they are called, git's merging tools are far more sophisticated. In fact too sophisticated for synchronizing file systems. On file systems merging on the level of files is fully sufficient and without additional knowledge over the properties and meaning of the individual files even dangerous! Combining these two points, it is self-evident that git is more than a source code control system and folks are looking for ways to use it in other ways. A little drawback of git sacrifices the portability to other architectures: the frequent use of scripting languages to implement important parts of functionality. Of course, for git that's an advantage since the development is so much faster. But for a porting git to architectures beside Unix, it's a real problem. I for one would love to use it to synchronize between Sun, Linux and Windows machines in our department. Additional, when git is used for synchronizing large datasets, the scripts may easily become a performance bottleneck. I realize that it is nearly impossible to change git to not depend on scripting in the near future but for the application I have in mind, I would like to solve that; that's the cause I started to playing around with the code presented above. Speaking of performance, looking on the index file handling code I suspect that it is not fast enough to handle really large datasets. For instance, reading an index file is O(N) since it is reading in each individual dataset and interprets it then. The same applies to operations on the index file - a pointer to each dataset is held in a linear array which must be manipulated when inserting or deleting a dataset - or writing out the index file. I expect that that could be improved if the layout of the index file is changed slightly; each dataset should have the same size. And as Linus has recently shown, more speed improvements could be gained if merging is handled directly on tree files instead of in the index: more evidence to play around with some code. But one difference in the usage pattern between the source code version control and backup/synchronizing requires some basic changes to the git algorithms which I suspect will not be acceptable and therefore requires a distinct implementation: when checking in source code, it is known that the files checked in do not change on disc. In the case of backup and synchronization such garanties can _not_ be made. Such an application should be a background task, hence stopping the generation of new data or altering existing data is generally no option. But it's interesting that the git techniques are a natural solution to this problem: the index file together with the object database allows to split the task needed to synchronize with other sides into three phases. First, updating the index from the file system and altering the object database, then merging the _constant_ index file with the other sides and finally, updating the file system (may be altered in the mean time) with the updates stored in the index file, but not altering changed data in the file system. What does this change in git? Only one basic assumption. The size of blobs are stored in the blob header and in git it is assumed, that this size is correct. In the backup/synchronization usage case this can not be held. During reading-in data files into the object database they may change their size on disk and the header of the blob may become out of sync. But this should be tolerated. In summary, using git techniques for backup and synchronisation is an interesting possibility, if some shortcomings could be solved. First, it should handle all file types like tar handles them. Second, the merging tools should be simplified; merging on file level is sufficient: when conflicts occure tell the user, but don't try to merge automatically. Third, it would be fine if it would be completely implemented in C to make it easily portable. Forth, it should handle any amount of data efficiently and should be able to synchronize arbitrarily contents between synchronization partners; of course even complete git repositories (another hint that that what I have in mind should be realized outside of git). And there we go, the presented application, replica, tries to realize the above points. It is ment to be a distributed synchronization tool. Think of it as a two-way rsync with a permanent state (therefore minimizing the workload on the servers). What is implemented so far? Well, not that much: reading in a filesystem into the object database and generating tree (filenames converted to unicode) and commit objects and some debugging code. Therefore, synchronization with other sides is comming next, but I think, you should be made aware about my current efforts. One difference to git is the index file handling; it may become interesting also for git itself. I have tried to implement it in a simple but efficient way. Basically, may version of the index file is built up of AVL trees with an 1:1 relationship to the directory tree store on disk. In difference to git, also directory entries representing subdirectories are held in the index file together with their meta data. Hence, even empty directories may be handled as expected by the user. I want to highlight the realisation of the AVL trees. The most interesting property of them is that they are position independent. One is able to map the index file into the memory space and then use it directly without changing any values in it. Hence, reading the index file and writing it out is essentially O(1); well, neglecting the time needed to read the contents from disk or writing it out of course. All AVL tree pointers are not pointers per se, but are realized as offsets to their neighbors. I think (and hope), this implementation may prove useful to be used in git also. Herbert. PS.: Sorry about this long posting but hopefully, some may find my work interesting. - : send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html