redhat@xxxxxxxxxxx wrote:
----- Original Message -----
From: rsivak@xxxxxxxxxxxxx
To: "CentOS Mailing list" <centos@xxxxxxxxxx>
Sent: Thursday, December 6, 2007 11:18:16 AM (GMT+1000) Australia/Brisbane
Subject: Filesystem that doesn't store duplicate data
Is there such a filesystem available? It seems like it wouldn't be
too hard to implement... Basically do things on a block by block
basis. Store md5 of a block in the table, and when writing a new
block, check if the md5 already exists and then point the new block to
the old block. Since md5 is not guaranteed unique, might need to do a
diff between the 2 blocks and if the blocks are indeed different,
handle it somehow.
When modifying an existing block that has multiple pointers, copy the
block and modify the new block.
I know I'm oversimplifying things a lot, but something like this could
work, no? Would be a great filesystem to store backups on, or things
like vmware volumes...
Russ
Sent from my Verizon Wireless BlackBerry
_______________________________________________
CentOS mailing list
CentOS@xxxxxxxxxx
http://lists.centos.org/mailman/listinfo/centos
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
You are describing what I understand to be 'Data De-duplication". It
is all the rage for backups as it has the potential to decrease backup
times and volumes by significant amounts. I went to a presentation by
Avamar (a partner of EMC ?) regarding this technology and it seemed
really nice for your typical windows file server. I suppose it
effectively turns your data into 'single-instance' which is no bad
thing. I suppose it could be useful for large database backups as well.
You'd think that using this technology on a live filesystem could
incur a significant performance penalty due to all those calculations
(fuse module anyone ?). Imagine a hardware optimized data
de-duplication disk controller, similar to raid XOR optimized cpus.
Now that would be cool. All it would need to store was meta-data when
it had already seen the exact same block. I think fundamentally it is
similar in result to on the fly disk compression.
Let us know when you have a beta to test !
8^)
I'm not sure if this would be possible to make available on a disk
controller, as I don't think a controller can store the amount of data
necessary to store the hashes. I am thinking of maybe making it as a
fuse module. I'm most familiar with Java, and there are fuse bindings
for java. I would love to make at least a proof of concept FS that does
this. Does fuse exist for windows? How does one test a fuse module
while developing it?
Russ
_______________________________________________
CentOS mailing list
CentOS@xxxxxxxxxx
http://lists.centos.org/mailman/listinfo/centos