Ruslan Sivak wrote: > > Peter Arremann wrote: > > On Wednesday 05 December 2007, redhat@xxxxxxxxxxx wrote: > > > >> You'd think that using this technology on a live > filesystem could incur a > >> significant performance penalty due to all those > calculations (fuse module > >> anyone ?). Imagine a hardware optimized data de-duplication disk > >> controller, similar to raid XOR optimized cpus. Now that > would be cool. All > >> it would need to store was meta-data when it had already > seen the exact > >> same block. I think fundamentally it is similar in result > to on the fly > >> disk compression. > >> > > > > Actually, the impact - if the filesystem is designed > correctly - shouldn't be > > that horrible. After all, Sun has managed to integrate > checksums into ZFS and > > still get great performance. In addition, ZFS doesn't > directly overwrite data > > but uses a new datablock each time... > > > > What you would have to do then is keep a lookup table with > the checksums to > > find possible matches quickly. Then when you find one, do > another compare to > > be 100% sure you didn't have a collision on your checksums. > If that works, > > then you can reference that datablock. > > > > It is still a lot of work, but as sun showed, on the fly > compares and > > checksums are doable without too much of a hit. > > > > Peter. > > > > > > > I'm not very knowledgeable on how filesystems work. Is there > a primer I > can brush up on somewhere? I'm thinking about implementing a > proof of > concept using Java and Fuse. How about a FUSE file system (userland, ie NTFS 3G) that layers on top of any file system that supports hard links, intercepts the FS API and stores all files in a hidden directory and names them after their MD5 hash and hard links to the file name in the user directory stucture. When the # of links drops to 1 then the hash is removed, when new files are copied in if the hash collides with an existing one the data is discarded and only a hard link is made. Of course it will be a little more involved then this, but the idea is to keep it really simple so it's less likely to break. -Ross ______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof. _______________________________________________ CentOS mailing list CentOS@xxxxxxxxxx http://lists.centos.org/mailman/listinfo/centos