Re: parent xattrs on file objects

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Greg,

In this case where an inode is created on mds.a and exported to mds.b, there is a potential race on mds.b between a subsequent lookup-by-ino and the primary link actually making it into the inode container.

Our tentative solution was to rely on the way InoTable breaks up the range of inode numbers based on mds nodeid. So when a lookup on the inode container fails, we can determine which mds would have allocated that inode number and attempt to find the inode there. The originating mds.a should always find the inode in its cache while it's pinned for export. Depending on whether the inode is found on mds.a, the lookup-by-ino on mds.b either returns failure or waits for the import to finish.

Casey

----- Original Message -----
From: "Gregory Farnum" <greg@xxxxxxxxxxx>
To: "Casey Bodley" <casey@xxxxxxxxxxxx>
Cc: "Matt W. Benjamin" <matt@xxxxxxxxxxxx>, ceph-devel@xxxxxxxxxxxxxxx, "aemerson" <aemerson@xxxxxxxxxxxx>, "peter honeyman" <peter.honeyman@xxxxxxxxx>, "Sage Weil" <sage@xxxxxxxxxxx>
Sent: Wednesday, October 17, 2012 4:18:04 PM
Subject: Re: parent xattrs on file objects

On Wed, Oct 17, 2012 at 12:40 PM, Casey Bodley <casey@xxxxxxxxxxxx> wrote:
> To expand on what Matt said, we're also trying to address this issue of lookups by inode number for use with NFS.
>
> The design we've been exploring is to create a single system inode, designated the 'inode container' directory, which stores the primary links to all inodes in the filesystem. These links are named by their inode number to satisfy lookups and obviate the need for an anchor table. This design allows the inode container to make use of existing directory fragmentation and load balancing to distribute the inodes over the MDS cluster.
>
> When a new file is created, it then adds two links: a primary link into the inode container, and a remote link into the filesystem namespace. In the case where the parent directory fragment's authority is different than the corresponding inode container fragment's, it is created in the parent directory then exported to the inode container via an asynchronous slave request.
>
> We welcome additional discussion, both on this design specifically and on the general topic of scalable ino lookups.

So if the primary link isn't always in the "inode container", you must
be preserving the anchor table for this setup. Am I understanding that
correctly? Or is there some other mechanism for linking them that's
less expensive?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux