Re: Filesystem that doesn't store duplicate data

Ruslan Sivak <rsivak@xxxxxxxxxxxxx> · Thu, 06 Dec 2007 10:48:27 -0500

This is a bit different then what I was proposing.  I know that backupPC 
already does this on a file level, but I want a filesystem that does it 
at a block level.  File level only helps if you're backing up multiple 
systems and they all have the same exact files.  Block level would help 
a lot more I think.  You'd be able to do a full backup every night and 
have it only take up around the same space as a differential backup.  
Things like virtual machine disk images which a lot of times are clones 
of each other, could take up only a small additinal amount of space for 
each clone, proportional to the changes that are made to that disk image. 

Nobody really answered this, so I'll ask again.  Is there a windows 
version of Fuse?  How does one test a fuse filesystem while developing 
it?  Would be nice to just be able to run something from eclipse, once 
you've made your changes and have a drive mounted and ready to test.  
Being able to debug a filesystem while it's running would be great too.  
Anyone here with experience building Fuse filesystems?

Russ

Ross S. W. Walker wrote:

These are all good and valid issues.

Thinking about it some more I might just implement it as a system 
service that scans given disk volumes in the background, keeps a 
hidden directory where it stores it's state information and hardlinks 
named after the md5 hash of the files on the volume. If a collission 
occurs with an existing md5 hash then the new file is unlinked and 
re-linked to the md5 hash file, if an md5 hash file exists with no 
secondary links then it is removed. Maybe monitor the journal or use 
inotify to just get new files and once a week do a full volume scan.

This way the file system performs as well as it normally does and as 
things go forward duplicate files are eliminated (combined). Of course 
the problem arises of what to do when 1 duplicate is modified, but the 
other should remain the same...

Of course what you said about revisions that differ just a little 
won't take advantage of this, but it's file level so it only works 
with whole files, still better then nothing.

-Ross

-----Original Message-----
From: centos-bounces@xxxxxxxxxx <centos-bounces@xxxxxxxxxx>
To: CentOS mailing list <centos@xxxxxxxxxx>
Sent: Thu Dec 06 08:10:38 2007
Subject: Re:  Filesystem that doesn't store duplicate data

On Thursday 06 December 2007, Ross S. W. Walker wrote:
> How about a FUSE file system (userland, ie NTFS 3G) that layers
> on top of any file system that supports hard links

That would be easy but I can see a few issues with that approach:

1) On file level rather than block level you're going to be much more
inefficient. I for one have gigabytes of revisions of files that have 
changed
a little between each file.

2) You have to write all datablocks to disk and then erase them again 
if you
find a match. That will slow you down and create some weird behavior. I.e.
you know the FS shouldn't store duplicate data, yet you can't use cp 
to copy
a 10G file if only 9G are free. If you copy a 8G file, you see the usage
increase till only 1G is free, then when your app closes the file, you are
going to go back to 9G free...

3) Rather than continuously looking for matches on block level, you 
have to
search for matches on files that can be any size. That is fine if you 
have a
100K file - but if you have a 100M or larger file, the checksum 
calculations
will take you forever. This means rather than adding a specific, small
penalty to every write call, you add a unknown penalty, proportional 
to file
size when closing the file. Also, the fact that most C coders don't 
check the
return code of close doesn't make me happy there...

Peter.
_______________________________________________
CentOS mailing list
CentOS@xxxxxxxxxx
http://lists.centos.org/mailman/listinfo/centos

------------------------------------------------------------------------
This e-mail, and any attachments thereto, is intended only for use by 
the addressee(s) named herein and may contain legally privileged 
and/or confidential information. If you are not the intended recipient 
of this e-mail, you are hereby notified that any dissemination, 
distribution or copying of this e-mail, and any attachments thereto, 
is strictly prohibited. If you have received this e-mail in error, 
please immediately notify the sender and permanently delete the 
original and any copy or printout thereof.
------------------------------------------------------------------------

_______________________________________________
CentOS mailing list
CentOS@xxxxxxxxxx
http://lists.centos.org/mailman/listinfo/centos

_______________________________________________
CentOS mailing list
CentOS@xxxxxxxxxx
http://lists.centos.org/mailman/listinfo/centos