On Fri, 5 Oct 2007, Steven Whitehouse wrote:
I stumbled upon an old document from back in 2000 (before RedHat acquired
Sistina), and they were talking about a number of features for the "next
version", including shadowing/copy-on-write.
The two features I am particularly interested in are:
1) Compression
I consider this to be important both for performance reasons and the fact
that no matter how cheap, disks will always be more expensive.
Performance-wise, at some point I/O becomes the bottleneck. Not
necessarily the disk I/O but network I/O of the SAN, especially when all
the nodes in the cluster are sharing the same SAN bandwidth. At that
point, reducing the data volume through compression becomes a performance
win. This point isn't all that difficult to reach even on a small cluster
on gigabit ethernet.
There are really two issues here rather than one:
1. Compression of data
Has, as a prerequisite, "allocate on flush" as we would really need
"compress on flush" in order to make this a viable option. Also we'd
need hints as to what kind of data we are looking at in order to make it
worthwhile. We'd also have to look at crypto too since you can't
compress encrypted data, the compression must come first if its
required.
Sure, but this is hardly a difficult problem. It could be based on any of:
1) file extension, perhaps listed somewhere in the /etc directory, and
only read on boot-up, or even provided as a comma-separated list to the
module/kernel at load-time (a file would be nicer, though).
2) Completely transparently based on a similar heuristic to what Reiser4
uses. For each file, try to compress the first 64KB. If it yields a
reasonable result, compress the rest, otherwise, flag as uncompressed
and don't bother. The user could override this by the appropriate chattr
command.
3) Leave it entirely up to the user - just inherit compression flag from
the parent directory. If the user says to compress, then don't question
it.
3) would be the simplest, and probably most useful. The only time
when a block should be left uncompressed is when compressing it makes it
get bigger.
2. Compression of metadata
This might well be worth looking into. There is a considerable amount
of redundancy in typical fs metadata, and we ought to be able to reduce
the number of blocks we have to read/write in order to complete an
operation in this way. Using extents for example could be considered a
form of metadata compression. The main problem is that our "cache line"
if you like in GFS(2) is one disk block, so that sharing between nodes
is a problem (hence the one inode per block rule we have at the moment).
We'd need to address the metadata migration issue first.
I'm not sure I understand what the problem is here. How is caching a
problem any more than it would otherwise be - considering we have multiple
nodes doing r/w ops on the same FS?
Neither of the above is likely to happen soon though as they both
require on-disk format changes.
Compatibility is already broken between GFS1 and GFS2. I don't see this as
an issue. The FS will get mounted with whatever parameters it was created
- and a new FS can be created with compression enabled.
2) Shadowing/Copy-On-Write File Versioning
Backups have 2 purposes - retrieving a file that was lost or corrupted
through user error, and files lost or corrupted through disk failure. High
levels of RAID alleviate the need for backup for the latter reason, but
they do nothing to alleviate user-error caused damage. At the same time
SANs can get big - I don't see hundreds of TB to be an inconcievable size.
At this size, backups become an issue. Thus, a feature to provide file
versioning is important.
In turn, 2) increases the volume of data, which increases the need for 1).
Are either of these two features planned for GFS in the near future?
This also requires on-disk format changes,
I don't remember implying that it wouldn't. But at the same time, why
would this be a problem? It's not like it means that people won't be able
to mount their GFS2 FS as they can now. And it's not like GFS2 works at
the moment, anyway (not with the latest packaged releases on any of the
spawns of RH (Fedore/CentOS, etc.)! :-p
but I agree that it would be
a nice thing to do. Its very much in my mind though as to what a
suitable scheme would be. We do have an ever increasing patent minefield
to walk through here too I suspect.
I very much doubt it. There are several OSS non-cluster FSs that provide
copy-on-write file versioning, and this has been used since the days of
VMS - which was now long enough ago that patents would have long since
expired.
Potentially it would be possible to address both of the above
suggestions (minus the metadata compression) by using a stacking
filesystem. That would be potentially more flexible by introducing the
features on all filesystems not just GFS(2),
Can you explain what you mean by stackable? I would have thought that
having a stacked file system on top of GFS would break GFS' ability to
function correctly in a clustered environment (not to mention introduce
unnecessary overheads).
Gordan
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster