Ruslan Sivak wrote: > > This is a bit different then what I was proposing. I know > that backupPC > already does this on a file level, but I want a filesystem > that does it > at a block level. File level only helps if you're backing up > multiple > systems and they all have the same exact files. Block level > would help > a lot more I think. You'd be able to do a full backup every > night and > have it only take up around the same space as a differential backup. > Things like virtual machine disk images which a lot of times > are clones > of each other, could take up only a small additinal amount of > space for > each clone, proportional to the changes that are made to that > disk image. Well then I would look at backup software that does block-level de-duplication. Even if the file system did do this, as the backup software read the files it would re-create the duplicate unless the backup software was intimately married to the file system, which makes things a little too proprietary. You will find that de-duplication can happen on many different levels here. I was proposing the near-line data at rest, while the far-line or archival data at rest would be a different scenario. Near-line needs to be more performance conscious then far-line. > Nobody really answered this, so I'll ask again. Is there a windows > version of Fuse? How does one test a fuse filesystem while > developing > it? Would be nice to just be able to run something from > eclipse, once > you've made your changes and have a drive mounted and ready to test. > Being able to debug a filesystem while it's running would be > great too. > Anyone here with experience building Fuse filesystems? While FUSE is a distinctly Linux development, Windows has had installable file system filters for a long time. These work a lot like stackable storage drivers in Linux and is the basis of a lot of storage tools on Windows including anti-virus software (as well as rootkits). Windows does have a de-duplication service that works on the file level much like what I proposed called the Single Instance Storage Groveler (I like to call it the single instance storage mangler :-), and high-end backup software companies have block level de-duplication options for their software, proprietary storage appliance companies also have block level de-duplication for their near and far line storage (big $$$). > Ross S. W. Walker wrote: > > > > These are all good and valid issues. > > > > Thinking about it some more I might just implement it as a system > > service that scans given disk volumes in the background, keeps a > > hidden directory where it stores it's state information and > hardlinks > > named after the md5 hash of the files on the volume. If a > collission > > occurs with an existing md5 hash then the new file is unlinked and > > re-linked to the md5 hash file, if an md5 hash file exists with no > > secondary links then it is removed. Maybe monitor the > journal or use > > inotify to just get new files and once a week do a full volume scan. > > > > This way the file system performs as well as it normally > does and as > > things go forward duplicate files are eliminated > (combined). Of course > > the problem arises of what to do when 1 duplicate is > modified, but the > > other should remain the same... > > > > Of course what you said about revisions that differ just a little > > won't take advantage of this, but it's file level so it only works > > with whole files, still better then nothing. > > > > -Ross > > > > > > -----Original Message----- > > From: centos-bounces@xxxxxxxxxx <centos-bounces@xxxxxxxxxx> > > To: CentOS mailing list <centos@xxxxxxxxxx> > > Sent: Thu Dec 06 08:10:38 2007 > > Subject: Re: Filesystem that doesn't store duplicate data > > > > On Thursday 06 December 2007, Ross S. W. Walker wrote: > > > How about a FUSE file system (userland, ie NTFS 3G) that layers > > > on top of any file system that supports hard links > > > > That would be easy but I can see a few issues with that approach: > > > > 1) On file level rather than block level you're going to be > much more > > inefficient. I for one have gigabytes of revisions of files > that have > > changed > > a little between each file. > > > > 2) You have to write all datablocks to disk and then erase > them again > > if you > > find a match. That will slow you down and create some weird > behavior. I.e. > > you know the FS shouldn't store duplicate data, yet you > can't use cp > > to copy > > a 10G file if only 9G are free. If you copy a 8G file, you > see the usage > > increase till only 1G is free, then when your app closes > the file, you are > > going to go back to 9G free... > > > > 3) Rather than continuously looking for matches on block level, you > > have to > > search for matches on files that can be any size. That is > fine if you > > have a > > 100K file - but if you have a 100M or larger file, the checksum > > calculations > > will take you forever. This means rather than adding a > specific, small > > penalty to every write call, you add a unknown penalty, > proportional > > to file > > size when closing the file. Also, the fact that most C coders don't > > check the > > return code of close doesn't make me happy there... > > > > Peter. ______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof. _______________________________________________ CentOS mailing list CentOS@xxxxxxxxxx http://lists.centos.org/mailman/listinfo/centos