Re: libcephfs: Open-By-Handle API - our patch and it's description

Sage Weil <sage@xxxxxxxxxxx> · Thu, 27 Jun 2013 12:39:25 -0700 (PDT)

On Thu, 27 Jun 2013, Matt W. Benjamin wrote:
> Hi Sage,
> 
> Adam and I haven't looked in detail yet, but based on the prior email 
> discussion, and not counting nits, Ilya's changes seem fine.
> 
> If you agree, I and Adam take on the task of rebasing all our changes 
> currently on wip-libcephfs, into a small number of changes, following 
> your lead in the "rebase" branch where possible.  (I had some questions 
> about how the rebase branch was assembled initially, before bugging you, 
> we'll have a fresh look at it.) We may need some interaction on IRC, but 
> with luck can have a prototype branch somewhere today or tomorrow.  
> (David, Sage, wrt IRC availability, does that sound likely to work?)

I am traveling today and tomorrow, so my IRC availability will be very 
limited.  But the rebase process I followed before is pretty 
straightforward.  Basically,

 git rebase master
 git reset master
 git gui
   -- interactively stage various bits and pieces and commit them in 
      logical units

It might be simpler to just use the branch I already started and fix up 
the commit messages (git rebase -i master, 'e' for each commit, and git 
commit --amend on each to adjust the changelog).

> Our work in progress changes logically follow all of this, so I'm not 
> worried about those yet.

Great!
sage

> 
> Matt
> 
> ----- "Sage Weil" <sage@xxxxxxxxxxx> wrote:
> 
> > Hi Ilya!
> > 
> > On Thu, 27 Jun 2013, Ilya Storozhilov wrote:
> > > Hi Matt (CC Sage, David, Adam),
> > > 
> > > 
> > > we have completed with our prototype of NFS/CIFS frontends to CEPH
> > > filesystem. In order to do it we had to fork ?wip-libcephfs? branch
> > (see
> > > https://github.com/ferustigris/ceph and it?s ?open_by_handle_api?
> > branch) to fix
> > > some issues with libcephfs open-by-handle API - see below. I?ve
> > attached a
> > > respective patch, which:
> > 
> > I posted some comments on github on the changes.  I think we're
> > close!
> > 
> > > Does it possible to merge our changes to ?wip-libcephfs? branch of
> > the official
> > > CEPH repository? If some action is needed from our side (further
> > amendments,
> > > merge-requests, etc.) - let us know, please.
> > 
> > Matt, what is your plan for this branch?  Is it/are you rebasing?  Are
> > 
> > there other changes in progress?
> > 
> > I'd very much like to get the basic changes (the open by handle stuff)
> > 
> > into the tree during the next week or two so that we make the dumpling
> > 
> > release cutoff.  Getting some of the additional bits that you guys 
> > specifically need ironed out (like the flushing and cap ref stuff) is
> > less 
> > important for that timeframe, if that helps.
> > 
> > Ideally, the patch series would look more like the rebased version I 
> > pushed (but with updated changelogs for each patch and correct 
> > authorship/sign-off.  If it's based on current master, and this patch
> > goes 
> > on top, I'll pull it in.
> > 
> > It would also be great to see some tests in test/libcephfs/ for the
> > new 
> > calls as well... :)
> > 
> > > There are still some unresolved issues in libcephfs, which are have
> > to be
> > > reported:
> > > 
> > >  * 
> > > 
> > >     A file could be manipulated by it?s resource handle (inode) even
> > if it has
> > >     been already removed by another client and syncing all data in
> > both
> > >     cephfs connections is not helpful in this case;
> > 
> > If I'm understanding you, this is normal unix behavior: open, unlink,
> > do 
> > i/o for as long as you want, close.  That is by design.
> > 
> > >  * 
> > > 
> > >     Invalid hardlinks count for directories: always 1 instead of 2
> > plus one
> > >     per each subdirectory as it?s dictated by POSIX.
> > 
> > This is a very common divergence from POSIX, and one you'll find with
> > 
> > btrfs and others as well.  IIRC the de facto convention is that if the
> > 
> > link count is 1 it does not reflect the sub directory count, but if it
> > is 
> > >= 2 it does.  You'll see code to this effect in GNU find (which tries
> > to 
> > use the link count to optimize its search when -type d).
> > 
> > Thanks!
> > sage
> > 
> > 
> > > 
> > > Should we report this bugs on github?
> > > 
> > > 
> > > Best regards!
> > > 
> > > --
> > > Ilya Storozhilov
> > > Lead Software Engineer
> > >  
> > > EPAM Systems
> > > Tver office, Russia
> > > GMT+4
> > > 
> > > ________________________________________
> > > ??: Matt W. Benjamin [matt@xxxxxxxxxxxx]
> > > ??????????: 6 ???? 2013 ?.21:47
> > > To: Ilya Storozhilov
> > > Cc: Sage Weil; aemerson; David Zafman
> > > ????: DRAFT: Re: HA: libcephfs: Open-By-Handle API question
> > > 
> > > Hi Ilya (CC Sage, David, Adam),
> > > 
> > > Thanks for looking at the branch.
> > > 
> > > We'll try to help you get the most out of Ganesha+Ceph, if that's
> > your goal.
> > > 
> > > I dont overly like "open-by-handle" as a descriptor for all the
> > changes we
> > > and David Zafman have made so far, but obviously, that's a core
> > > functionality it's providing to, e.g., an NFS re-exporter.
> > > 
> > > In terms of your document,
> > > 
> > > #1 The "resource handle" vs. pointer to opaque object distinction
> > seems like
> > > a merely cosmetic change. What the API does now is typical of many C
> > APIs
> > > (see cred_t in NetBSD kernel, many others), as well as C-over-C++
> > APIs.
> > > (Perhaps this branch has actual C++ leakage, if so, it's
> > inadvertant, and
> > > we'll fix.)
> > > 
> > > #2 The "recoverable handle" concept needs to be subordinate to the
> > semantics
> > > of the objects and capability model of Ceph. It sounds like you're
> > generally
> > > speaking about objects at the NFS protocol level. With respect to
> > NFS
> > > filehandles, everything you're asserting is already true. With
> > respect to
> > > open state and other cluster coherence concepts, you've gone past
> > the NFS
> > > protocol to details of a clustered implementation. Our team has
> > specific
> > > requirements here that may differ from yours, just fyi.
> > > 
> > > #3, #4 I (and Adam here) aren't fully persuaded of this. While
> > embedding
> > > objects in Ganesha handle memory would save small allocations, that
> > > may not be a critical consideration. From a Ganesha perspective,
> > it's
> > > definitely just a hidden property of the driver implementation. I
> > think
> > > intuitively we're not really keen on having Inode objects (say) be
> > expanded
> > > in caller handles, although yes, Ganesha itself can support it.
> > > 
> > > #5 No objection here.
> > > 
> > > #6 I've been thinking of the "which level of API clients should use,
> > and
> > > which a slight modification" in more historical, and less
> > prescriptive
> > > terms. I do think it makes sense to continue refactoring to reduce
> > code
> > > duplication, in addition to getting thread-level concurrency in the
> > Client
> > > cache (that's what my api-concurrent topic branch is focused on).
> > Obviously,
> > > we want to do this without inconveniencing other users.
> > > 
> > > Other points from your email:
> > > 
> > > The lack of a convenience function for getting the root inode is
> > just
> > > artifact of Fuse, Ganesha+Ceph, and others being able to construct
> > the inode
> > > descriptively. I don't think it's really a C++ encapsulation issue.
> > We can
> > > add an API (inline in Client) to get the root inode, however, and
> > will add
> > > it, if there's no objection.
> > > 
> > > Getattr fixes. May not be API related. I planned to push a revision
> > of
> > > libcephfs-wip this week, so perhaps fixes can follow this.
> > > 
> > > We haven't spent time with extended attributes so far, because
> > Ganesha's
> > > current xattr implementation is a prototype likely to change soon.
> > We may
> > > have missed something obvious here, though, I'll look at it.
> > > 
> > > Regards,
> > > 
> > > Matt
> > > 
> > > ----- "Ilya Storozhilov" <Ilya_Storozhilov@xxxxxxxx> wrote:
> > > 
> > > > Hi Matt,
> > > >
> > > > we have organized some testing of the open-by-handle API staff
> > from
> > > > the ?libcephfs-wip? Ceph repo branch and I?m sending to you our
> > > > thoughts about it. I will mention inode structure as ?resource
> > handle?
> > > > in this letter - this is how NFS-Ganesha considers it.
> > > >
> > > > First of all there is a small bug, when while trying to make file
> > > > operations on the resource which has been already removed in
> > another
> > > > CephFS connection - you can even open this file for writing and
> > append
> > > > some data to it. Actually there should be a ENOENT error, but
> > there is
> > > > not. An ceph_ll_getattr() operation returns no error on this
> > resource
> > > > handle but the amount of hard-links equals to zero in returned
> > stat
> > > > structure.
> > > >
> > > > There is no method for listing extended attributes (e.g.
> > > > ceph_ll_listxattr()) and to lookup a root resource handle (e.g.
> > > > ceph_ll_lookup_root() - it was mentioned in my previous letter).
> > > >
> > > > If you try to fetch an extended attribute value by calling
> > > > ceph_ll_getxattr() after setting in using ceph_ll_setxattr() you
> > will
> > > > receive a ENODATA error. ceph_ll_removexattr() does not return
> > ENODATA
> > > > when trying to remove non-existent extended attribute.
> > > >
> > > > So at the moment we have made a draft list of key requirements to
> > the
> > > > open-by-handle CephFS API and created an API header prototype -
> > look
> > > > at the attached file, please. What do you think about it?
> > > >
> > > > P.S. Generally speaking I think we can provide you with the
> > patch,
> > > > which fixes these issues, if you need. But mostly there will be
> > only
> > > > workarounds, cause, from my point of view a little reengineering
> > is
> > > > needed - see attached document.
> > > >
> > > > Best regards,
> > > > Ilya
> > > >
> > > > ________________________________________
> > > > ??: Matt W. Benjamin [matt@xxxxxxxxxxxx]
> > > > ??????????: 4 ???? 2013 ?.17:10
> > > > To: Ilya Storozhilov
> > > > Cc: ceph-devel@xxxxxxxxxxxxxxx
> > > > ????: Re: libcephfs: Open-By-Handle API question
> > > >
> > > > Hi Ilya,
> > > >
> > > > The changes on this branch originated in our Ganesha NFS driver
> > for
> > > > Ceph, so I'm not sure where the gap is, if any.
> > > >
> > > > I'll send an update to the list when we've finish re-integrating
> > > > against
> > > > the libcephfs-wip merge branch.
> > > >
> > > > Matt
> > > >
> > > > ----- "Ilya Storozhilov" <Ilya_Storozhilov@xxxxxxxx> wrote:
> > > >
> > > > > Hi Ceph developers,
> > > > >
> > > > > in order to represent NFS-frontend to CephFS data storage we
> > are
> > > > > trying to use innovative Open-By-Handle API from
> > > > > 'src/include/cephfs/libcephfs.h' file, which is of
> > 'wip-libcephfs'
> > > > > branch at the moment. API looks quite consistent and useful but
> > we
> > > > > couldn't find a method to get a pointer to root inode of the
> > mounted
> > > > > Ceph filesystem.
> > > > >
> > > > > At the moment we have found only one place, where it could be
> > > > fetched
> > > > > from: an 'Inode* root' member from the 'Client' class
> > > > > ('src/client/Client.h') but it is in 'protected' section, so
> > some
> > > > hack
> > > > > is needed (e.g. to introduce a Client's descendant, which is
> > > > providing
> > > > > a method to acces this protected member). Do you know, how to
> > fetch
> > > > a
> > > > > pointer to the root inode of the mounted Ceph filesystem without
> > any
> > > > > hacking (using just an official CephFS API only)?
> > > > >
> > > > > Thank you and best wishes,
> > > > > Ilya V. Storozhilov
> > > > > EPAM Systems
> > > > > Lead Software Engineer
> > > > >
> > > > > P.S. What do you think about to make 'Open-By-Handle' API to be
> > a
> > > > > primary and not low-level API to CephFS and to make POSIX-like
> > API
> > > > to
> > > > > be just a helper addendum to it?--
> > > > > To unsubscribe from this list: send the line "unsubscribe
> > > > ceph-devel"
> > > > > in
> > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > > > More majordomo info at
> > http://vger.kernel.org/majordomo-info.html
> > > >
> > > > --
> > > > Matt Benjamin
> > > > The Linux Box
> > > > 206 South Fifth Ave. Suite 150
> > > > Ann Arbor, MI 48104
> > > >
> > > > http://linuxbox.com
> > > >
> > > > tel. 734-761-4689
> > > > fax. 734-769-8938
> > > > cel. 734-216-5309
> > > 
> > > --
> > > Matt Benjamin
> > > The Linux Box
> > > 206 South Fifth Ave. Suite 150
> > > Ann Arbor, MI 48104
> > > 
> > > http://linuxbox.com
> > > 
> > > tel. 734-761-4689
> > > fax. 734-769-8938
> > > cel. 734-216-5309
> > > 
> > >
> 
> -- 
> Matt Benjamin
> The Linux Box
> 206 South Fifth Ave. Suite 150
> Ann Arbor, MI  48104
> 
> http://linuxbox.com
> 
> tel.  734-761-4689 
> fax.  734-769-8938 
> cel.  734-216-5309 
> 
>