Re: afr's ns-brick and posix-locks

"Daniel van Ham Colchete" <daniel.colchete@xxxxxxxxx> · Thu, 6 Dec 2007 16:57:18 -0200

Well, the idea is ugly, but I think it works.

The main problem with inode numbers is that we have no control at all (with
POSIX's open() anyway) of what the inode number of one file will be. That is
the problem. Amar proposed a distributed namespace cache algorithm a few
months ago that, in my humble opinion, fails because of that. You cannot
make a union of some 64 bits space and expect to unify all of them on a 64
bits space. Meaning you cannot have a distributed namespace cache for the
inode numbers using posix filesystems to store those files because inode
numbers can be anything between 1 and (2**64-1) on each and a union of two
filesystems will have inode number colisions.

The obvious solution is not to use posix filesystems to store the namespace
cache. Just that. Everything bellow are simple facts comes out easily from
it.

Well, we'll store any information necessary to the namespace brick in an
database format made specifically to that end. We can use a modified version
of ext3 or xfs or reiser or make glusterfs's own. Think of it as a
translator that have open(), close(), getxattr(), flock(), fcntl() and
anything else necessary to each file's metadata, but will always return 0 on
read() and write(). That database format will have a restricted inode number
space (say 48 bits).

To do the distributed magic we'll use the way IP addresses works. The first
16 bits of an inode number will be the namespace brick ID (maybe generated
by the client each time the glusterfs if mounted). The last 48 bits will be
the pre-inode number given of the by the namespace brick. The namespace
brick doesn't know what brick ID each client gave to it. Like unify, when
open()ing a file, we look at all the namespace bricks to see witch one has
the file's metadata. It will return you an pre-inode number with 48 bits. To
have the file external inode number just do an AND operation:

INODE NUMBER = [BRICK ID - 16 BITS] [ PRE-INODE NUMBER - 48bits]

The bits division should be better evaluated. Maybe 10 / 54 or 9 / 55...

So, with any operation that happens on the inode numbers it is really fast
to determine the namespace cache brick to send the message.

I think we wouldn't be able to use AFR here. So, the redundancy would be
implemented specifically to that kind of translator. First, we have to make
sure that, created files will have the same inode number on all the
replicated bricks. For healing there are two options:
1 - file by file self-heal, using the currently implemented algorithms
2 - database dump and restore when getting back online.

In the end, the amount of data copied would be almost the same, the first
takes more time to complete and leaves the system vulnerable to data loss
for more time. The second lock the file creating during the healing phase
and the brick would have to be aware of the replication scheme.

So, the final features are:
 - it is distributed and can scale to every performance needs
 - it doesn't limit the number of files gluster can handle, it still can
handle 2**64 files (thats the Linux's kernel limitation).
 - allows replication and no single point of failure.

I think it is ugly but, do you think it could work?

Best regards,
Daniel

On Dec 6, 2007 3:56 PM, Anand Avati <avati@xxxxxxxxxxxxx> wrote:

>
>
> >
> > I've been a little bit out of GlusterFS lately but, what about the issue
> >
> > with inode numbers changing with the first server (in the AFR system)
> > goes
> > out making fuse crazy? How are things going with the distributed
> > namespace
> > cache? I had an idea about this, it is ugly but fixes the problem if it
> > hasn't been fixed already.
>
>
> Currently we use inode generation based workarounds. I'm intersted in the
> idea :)
>
>
> avati
>