Well, the idea is ugly, but I think it works. The main problem with inode numbers is that we have no control at all (with POSIX's open() anyway) of what the inode number of one file will be. That is the problem. Amar proposed a distributed namespace cache algorithm a few months ago that, in my humble opinion, fails because of that. You cannot make a union of some 64 bits space and expect to unify all of them on a 64 bits space. Meaning you cannot have a distributed namespace cache for the inode numbers using posix filesystems to store those files because inode numbers can be anything between 1 and (2**64-1) on each and a union of two filesystems will have inode number colisions. The obvious solution is not to use posix filesystems to store the namespace cache. Just that. Everything bellow are simple facts comes out easily from it. Well, we'll store any information necessary to the namespace brick in an database format made specifically to that end. We can use a modified version of ext3 or xfs or reiser or make glusterfs's own. Think of it as a translator that have open(), close(), getxattr(), flock(), fcntl() and anything else necessary to each file's metadata, but will always return 0 on read() and write(). That database format will have a restricted inode number space (say 48 bits). To do the distributed magic we'll use the way IP addresses works. The first 16 bits of an inode number will be the namespace brick ID (maybe generated by the client each time the glusterfs if mounted). The last 48 bits will be the pre-inode number given of the by the namespace brick. The namespace brick doesn't know what brick ID each client gave to it. Like unify, when open()ing a file, we look at all the namespace bricks to see witch one has the file's metadata. It will return you an pre-inode number with 48 bits. To have the file external inode number just do an AND operation: INODE NUMBER = [BRICK ID - 16 BITS] [ PRE-INODE NUMBER - 48bits] The bits division should be better evaluated. Maybe 10 / 54 or 9 / 55... So, with any operation that happens on the inode numbers it is really fast to determine the namespace cache brick to send the message. I think we wouldn't be able to use AFR here. So, the redundancy would be implemented specifically to that kind of translator. First, we have to make sure that, created files will have the same inode number on all the replicated bricks. For healing there are two options: 1 - file by file self-heal, using the currently implemented algorithms 2 - database dump and restore when getting back online. In the end, the amount of data copied would be almost the same, the first takes more time to complete and leaves the system vulnerable to data loss for more time. The second lock the file creating during the healing phase and the brick would have to be aware of the replication scheme. So, the final features are: - it is distributed and can scale to every performance needs - it doesn't limit the number of files gluster can handle, it still can handle 2**64 files (thats the Linux's kernel limitation). - allows replication and no single point of failure. I think it is ugly but, do you think it could work? Best regards, Daniel On Dec 6, 2007 3:56 PM, Anand Avati <avati@xxxxxxxxxxxxx> wrote: > > > > > > I've been a little bit out of GlusterFS lately but, what about the issue > > > > with inode numbers changing with the first server (in the AFR system) > > goes > > out making fuse crazy? How are things going with the distributed > > namespace > > cache? I had an idea about this, it is ugly but fixes the problem if it > > hasn't been fixed already. > > > Currently we use inode generation based workarounds. I'm intersted in the > idea :) > > > avati >