At 10:55 AM 1/21/2009, Stas Oskin wrote: >This the bit I don't understand - shouldn't the Lustre nodes sync >the data between themselves? If there is a shared storage device >needed on some medium, what then the Lustre storage nodes actually do? > >I mean, what is the idea of Lustre being cluster system, if it >requires a central shared storage device? clustering and high available are NOT the same thing. people often confuse them, but they're completely different concepts. "clustering" is simply the concept of grouping a bunch of things into a "cluster" so that they can interact with eachother in a desired fashion. so a cluster filesystem is simply one that can be managed by various nodes in the cluster. this CAN mean that all nodes have read/write access to the filesystem, but it doesn't HAVE to mean that. it CAN mean that multiple nodes participate in a single filesystem (which seems to be the case in Lustre)...in other words, a distributed filesystem where some files are on some nodes and others are on other nodes is often defined as a "cluster filesystem" however, this has no implication of redundancy in the data. generally a High-Availability filesystem" will add features like replication/redundancy in order to insure that the filesystem survives a failure of some kind. Gluster combines both of these and can be implemented to be a HA Clustered filesystem. without the HA translator, however, gluster (& lustre) are "cluster" filesystems, but you may run into a problem if there's a node failure. Depending on your application this is acceptable, if it's not, then you have to add HA features. Realizing, of course, ha features come at a cost (performance, disk, cpu, etc.). >But the shared block device may be a distributed mirrored block device >(like DRBD) which mirrors each data block as it is written to its peer >node. In such a configuration the data is actually stored on both nodes in >the failover pair. My guess is that this is not a common configuration for >production use. > > >AFAIK such config could be achieved without Lustre at all - just >with 2 severs acting as storage nodes. This of course would make an >active-passive mode, and waste 50% of the resources. however, what you don't get in that environment, is read/write access to the filesystem from both nodes. You'd get HA in that if one node failed, the other node could then mount the block device/filesystem and continue working, but you wouldn't be able to write directly to the block devices on both nodes at the same time. Writes would happen through Lustre, which would most likely have one node in "control" of both the local and remote block device so the remote node couldn't write directly to it and instead would send it's writes over the network to the control node which would do the actual writing to the block devices. the way most shared storage volume managers work is: they put a system id tag somewhere on the physical device. other nodes read this to see who "owns" the block device. the owner periodically updates a timestamp on the device. if this timestamp doesn't change in x number of cycles, then another node may take ownership. So it overwrites the id tag. waits and checks to see if another node did the same thing. if not, then it can now claim ownership. it then manages all writes to the physical device. This is because, traditionally, filesystems used memory for file locking (and some caching) since they didn't have to ever worry about some other machine modifying the filesystem. as such the volume manager needed to insure that only one machine would have access to the volumes/filesystems or you would have severe data integrity issues. Then came along cluster aware filesystems such as OCFS2. These eliminated filesystem caching and moved the locking from memory to the filesystem itself. Now, you could have 2 nodes physically accessing the same disk device because they could tell if some other machine had a read-lock on a file or block of data. These filesystems, again, have lower performance because your file locking is happening at disk speed, instead of memory speed, and you totally loose any benefit of caching since you have to insure data is finished writing to the physical media before releasing a lock. the volume managers would do various levels of raid. Since the clustered filesystems want to interact directly with the disk devices, the cluster aware volume managers don't always work (most still want only one machine to control the device at any given time). so we've had to wait for these filesystems to include HA (raid/mirroring, etc) so that we could survive disk failures. Ideally, we would have a cluster aware volume manager which handled the raid issues and could manage remote physical devices, and on top of that cluster aware filesystems (like ocfs2) which would give us the distributed access to the filesystem along with the conveniences of a volume manager. I think we're years away from having anything like that, but it'd be ideal.