Re: Consistency vs efficiency

Sage Weil <sage@xxxxxxxxxxxx> · Tue, 26 Jul 2011 13:21:40 -0700 (PDT)

On Tue, 26 Jul 2011, Jojy Varghese wrote:
> Sage,
> Will get back to you with the logs.
> 
> Had another question about the implementation:
> 
> 
> Here is the piece of code that I am a bit confused about:
> 
> In "fs/ceph/inode.c" ( function "ceph_fill_trace")
> 
> 			/* do we have a lease on the whole dir? */
> 		have_dir_cap =
> 			(le32_to_cpu(rinfo->diri.in->cap.caps) &
> 			 CEPH_CAP_FILE_SHARED);
> 
> 		/* do we have a dn lease? */
> 		have_lease = have_dir_cap ||
> 			(le16_to_cpu(rinfo->dlease->mask) &
> 			 CEPH_LOCK_DN);
> 
> So we check the capability "CEPH_CAP_FILE_SHARED" to make sure the
> entire directory has the lease.  "have_lease" is then used to
> determine if the dentry can be cached.

Right.

> I would have thought that it should be "CEPH_CAP_FILE_EXCL" that
> should be used to determine if a dentry can be cached since it would
> mean that the client created the file and has "update" capabilities.

There are two cases we care about: FILE_SHARED and FILE_SHARED|FILE_EXCL.  
In the former case, the tree is static, and all clients can cache the 
dentry.  In the latter case, a single client is using the directory and 
can still cache it.

The client doesn't make any namespace modifications locally, however; 
those currently always go to the MDS.  In practical terms it means the 
client doesn't have to release the caps to perform a conflicting operation 
(i.e. relase FILE_EXCL on the directory to create a file within it).  (In 
contrast, a client with just FILE_SHARED will release that along with the 
create request to avoid an extra exchange for the mds to revoke it.)

BTW, there is a fix that was just pushed for 3.1 that may be affecting the 
lease behavior, see commit b1c9396ee20a3b37804376ba23bd780068d0ccf7 on the 
kernel side and 5dc09dd6b81c622960f628acdabda9eac8af1ceb on the server 
side.  That might explain what you're seeing... I forgot about it earlier.

sage

> 
> thanks
> -Jojy
> 
> 
> 
> 
> 
> 
> 
> 
> On Tue, Jul 26, 2011 at 10:56 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> > On Tue, 26 Jul 2011, Jojy Varghese wrote:
> >> Sage
> >>  I tried the simple use case of mkdir on the ceph mounted dir but
> >> still see the issue. So i am wondering if our setup has anything to do
> >> with it (although ideally it should not). Anything i should be looking
> >> at given this behavior?'
> >
> > Can you capture the mds and kernel logs for the simple case?
> >
> > debug mds = 20
> > debug ms = 1
> >
> > and for the kernel side run ceph.git's src/scripts/kcon_most.sh (or
> > similar)
> >
> > Thanks!
> > sage
> >
> >>
> >>
> >>
> >> thx
> >> Jojy
> >>
> >> On Mon, Jul 25, 2011 at 9:12 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> >> > On Mon, 25 Jul 2011, Jojy Varghese wrote:
> >> >> What i observe is that after a mkdir, the inode CAPS loses the
> >> >> lease(FILE_SHARED). I would have thought that the owing client should
> >> >> have a FILE_EXCL on the files/dirs it creates.
> >> >>
> >> >> Since it doesnt have a lease, the dentry(after splicing) is not cached.
> >> >
> >> > Can you describe the specific sequence of operations you're doing?  I'm
> >> > not seeing this behavior.  I see
> >> >
> >> > $ mkdir foo
> >> >        client->mds lookup #1/foo
> >> >        client->mds mkdir #1/foo
> >> > $ mkdir foo/a
> >> >        client->mds lookup #100000000/a
> >> >        client->mds mkdir #100000000/a
> >> >
> >> > with no repeated lookup on foo.
> >> >
> >> > sage
> >> >
> >> >
> >> >>
> >> >> thanks
> >> >> Jojy
> >> >>
> >> >> On Sat, Jul 23, 2011 at 2:56 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> >> >> > On Fri, 22 Jul 2011, Jojy Varghese wrote:
> >> >> >> Not sure how it is designed to work but I assume that some kind of
> >> >> >> async RPC mechanism exists from the MDCs to the clients to update the
> >> >> >> CAP for a file from "exclusive" to "shared". This will allow the
> >> >> >> cached dentries to be pruned/dropped when another client updates the
> >> >> >> file.
> >> >> >
> >> >> > Right.  If the MDS needs to modify a dentry, it revoke any issued client
> >> >> > leases before granting the write/exclusive lock to process the request.
> >> >> >
> >> >> > sage
> >> >> >
> >> >> >>
> >> >> >> -Jojy
> >> >> >>
> >> >> >> On Fri, Jul 22, 2011 at 8:51 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> >> >> >> > On Fri, 22 Jul 2011, Jojy Varghese wrote:
> >> >> >> >> Sage would the latest patches fix the lookup issue?
> >> >> >> >
> >> >> >> > No, the blocker there is the '[PATCH] vfs: add d_prune dentry operation'
> >> >> >> > email on Jul 8 to linux-fsdevel and lkml.  Once this set goes in (and
> >> >> >> > cleans up a bunch of stuff Al found in a code audit last weekend) I'll be
> >> >> >> > bugging him about it again.
> >> >> >> >
> >> >> >> > sage
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >>
> >> >> >> >> On Thu, Jul 21, 2011 at 10:55 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> >> >> >> >> > On Thu, 21 Jul 2011, Jojy Varghese wrote:
> >> >> >> >> >> Thanks for the response Sage. We are using 2.6.39 kernel and in the
> >> >> >> >> >> "ceph_lookup" method, i see that there is a shortcut for deciding
> >> >> >> >> >> ENOENT but after the MDS lookup, i dont see a d_add. I am sure i am
> >> >> >> >> >> missing something here.
> >> >> >> >> >
> >> >> >> >> >                        dout(" dir %p complete, -ENOENT\n", dir);
> >> >> >> >> >                        d_add(dentry, NULL);
> >> >> >> >> >
> >> >> >> >> > ...but that is only for the negative lookup in a directory with the
> >> >> >> >> > 'complete' flag set.  And it's never set currently because we don't have
> >> >> >> >> > d_prune yet (and the old use of d_release was racy).  So ignore this part
> >> >> >> >> > for now!
> >> >> >> >> >
> >> >> >> >> > You have an existing, unchanging, directory that you're seeing repeated
> >> >> >> >> > lookups on, right?  Like the top-level directory in the heirarchy you're
> >> >> >> >> > copying?  And the client is doing repeated lookups on the same name?
> >> >> >> >> >
> >> >> >> >> > The way to debug this is probably to start with the messages passing to
> >> >> >> >> > the MDS and verifying that lookups are duplicated.  Then enable the
> >> >> >> >> > logging on the kernel client and see why the client isn't uses leases or
> >> >> >> >> > the FILE_SHARED cap to avoid them.  We can help you through that on #ceph
> >> >> >> >> > if you like.
> >> >> >> >> >
> >> >> >> >> > sage
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> >>
> >> >> >> >> >> thanks again
> >> >> >> >> >> Jojy
> >> >> >> >> >>
> >> >> >> >> >> On Thu, Jul 21, 2011 at 9:49 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> >> >> >> >> >> > On Thu, 21 Jul 2011, Jojy Varghese wrote:
> >> >> >> >> >> >> Hi
> >> >> >> >> >> >>   I just started looking at the ceph code in kernel and had a question
> >> >> >> >> >> >> about performance considerations for lookup operations. I noticed that
> >> >> >> >> >> >> for every operation (say copying a directory), the root dentry is
> >> >> >> >> >> >> "looked" up multiple times and since they all go to MDS for the actual
> >> >> >> >> >> >> lookup operation, it effects the performance. I am sure consistency is
> >> >> >> >> >> >> the winner here. Is there any plan to improve this, maybe by having
> >> >> >> >> >> >> MDS push the capability down to the clients when the dentry is
> >> >> >> >> >> >> updated. So say from CAP_EXCL to CAP_SHARED when the dentry is
> >> >> >> >> >> >> modified. This was the client node can cache the lookup operation and
> >> >> >> >> >> >> does not have to make a round trip to the MDS.
> >> >> >> >> >> >
> >> >> >> >> >> > In general, the MDS has two ways of keeping a client's cached dentry
> >> >> >> >> >> > consistent:
> >> >> >> >> >> >
> >> >> >> >> >> >  - it can issue the FILE_SHARED capability bit on the parent directory,
> >> >> >> >> >> > which means the entire directory is static and the client can cache
> >> >> >> >> >> > dentry.
> >> >> >> >> >> >  - if it can't do that, it will issue a per-dentry lease
> >> >> >> >> >> >
> >> >> >> >> >> > There is an additional 'complete' bit that is used to indicate on the
> >> >> >> >> >> > client that it has the _entire_ directory in cache.  If set, it can do
> >> >> >> >> >> > negative lookups and readdir without hitting the MDS.  That's currently
> >> >> >> >> >> > broken, pending the addition of a d_prune dentry_operation (see
> >> >> >> >> >> > linux-fsdevel email from July 8).
> >> >> >> >> >> >
> >> >> >> >> >> > Anyway, long story short, if you're seeing repeated lookups on a dentry
> >> >> >> >> >> > that isn't changing, something is broken.  Can you describe the workload
> >> >> >> >> >> > in more detail?  Which versions of the client and mds are you running?
> >> >> >> >> >> >
> >> >> >> >> >> > sage
> >> >> >> >> >> >
> >> >> >> >> >> >
> >> >> >> >> >> --
> >> >> >> >> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> >> >> >> >> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >> >> >> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> --
> >> >> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> >> >> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >>
> >> >> >>
> >> >>
> >> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
>