Re: parent xattrs on file objects

Gregory Farnum <greg@xxxxxxxxxxx> · Wed, 17 Oct 2012 15:04:23 -0700

I still don't get it. Putting every inode's primary link in a lookup
directory and then patching the lookup code to go there makes sense to
me. But if you have to go the other way (from the inode directory's
secondary link to some other location as the primary link), you need
an up-to-date path for that primary link, right? How do you handle it
when the path changes — do you have a two-phase commit on the lookup
directory attributes?

On Wed, Oct 17, 2012 at 2:51 PM, Casey Bodley <casey@xxxxxxxxxxxx> wrote:
> Hi Greg,
>
> In this case where an inode is created on mds.a and exported to mds.b, there is a potential race on mds.b between a subsequent lookup-by-ino and the primary link actually making it into the inode container.
>
> Our tentative solution was to rely on the way InoTable breaks up the range of inode numbers based on mds nodeid. So when a lookup on the inode container fails, we can determine which mds would have allocated that inode number and attempt to find the inode there. The originating mds.a should always find the inode in its cache while it's pinned for export. Depending on whether the inode is found on mds.a, the lookup-by-ino on mds.b either returns failure or waits for the import to finish.
>
> Casey
>
> ----- Original Message -----
> From: "Gregory Farnum" <greg@xxxxxxxxxxx>
> To: "Casey Bodley" <casey@xxxxxxxxxxxx>
> Cc: "Matt W. Benjamin" <matt@xxxxxxxxxxxx>, ceph-devel@xxxxxxxxxxxxxxx, "aemerson" <aemerson@xxxxxxxxxxxx>, "peter honeyman" <peter.honeyman@xxxxxxxxx>, "Sage Weil" <sage@xxxxxxxxxxx>
> Sent: Wednesday, October 17, 2012 4:18:04 PM
> Subject: Re: parent xattrs on file objects
>
> On Wed, Oct 17, 2012 at 12:40 PM, Casey Bodley <casey@xxxxxxxxxxxx> wrote:
>> To expand on what Matt said, we're also trying to address this issue of lookups by inode number for use with NFS.
>>
>> The design we've been exploring is to create a single system inode, designated the 'inode container' directory, which stores the primary links to all inodes in the filesystem. These links are named by their inode number to satisfy lookups and obviate the need for an anchor table. This design allows the inode container to make use of existing directory fragmentation and load balancing to distribute the inodes over the MDS cluster.
>>
>> When a new file is created, it then adds two links: a primary link into the inode container, and a remote link into the filesystem namespace. In the case where the parent directory fragment's authority is different than the corresponding inode container fragment's, it is created in the parent directory then exported to the inode container via an asynchronous slave request.
>>
>> We welcome additional discussion, both on this design specifically and on the general topic of scalable ino lookups.
>
> So if the primary link isn't always in the "inode container", you must
> be preserving the anchor table for this setup. Am I understanding that
> correctly? Or is there some other mechanism for linking them that's
> less expensive?
> -Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html