Re: ganesha.conf template for nfs-rgw (was Re: getting inconsistent results in nfs-rgw readdir :( )

Daniel Gryniewicz <dang@xxxxxxxxxx> · Wed, 26 May 2021 10:02:46 -0400

Setting dir_chunk=0 bypasses all Ganesha readdir code.  This is what 
completely breaks RGW, since RGW depends on that code.  A bit of background.

NFS uses a POSIX-like cookie system for readdir.  This means that each 
dirent has a cookie (64-bit integer) associated with it, that, when 
passed to the server in another readdir, get then dirent *after* the 
dirent associated with the cookie.  This can cause issues, even on local 
filesystems, when mutation of the directory changes the ordering of dirents.

RGW cannot support this directly, as it has no concept of an inode. 
Instead, it does listings based on object names, of "arbitrary" length, 
and the listing starts with the first object matching the name.  This 
means that something has to do a mapping between names and cookies. 
Since Ganesha is the NFS translator, it does this mapping, in it's 
readdir code.  This means that disabling the readdir code (with 
dir_chunk=0) will break RGW directory listings completely.

It's my opinion that dir_chunk should generally be used by all FSALs, if 
at all possible.  We've put a lot of work into the readdir code, and in 
all of our testing, it has made readdir traversals much faster than 
having it disabled.  I admit, I have not run numbers on CephFS, so this 
might be the exception.

This opinion aside, RGW depends on having dir_chunk enabled, and this is 
a global config knob on Ganesha.  So, if CephFS and RGW are going to 
share a Ganesha instance, it has to have dir_chunk enabled.  If we want 
to have dir_chunk disabled for CephFS, then we will need to stand up 
separate Ganesha instances for CephFS and RGW.  This should be doable on 
a containerized setup, but will be more difficult on real hardware, as 
one of them will need to run on a non-standard port.

(And the docs aren't great, sorry about that)

Daniel

On 5/26/21 8:55 AM, Jeff Layton wrote:
Good question:

When I originally started doing the clustered ganesha over cephfs work,
I immediately moved to disable as much caching as possible in ganesha,
figuring that:

a) double caching is wasteful with memory

...and...

b) that libcephfs knows better when it's safe to cache and when not

So, other more experienced ganesha folks recommended some settings at
that time, including dir_chunk=0.

I've never done any significant testing with dir_chunk set to anything
but 0, and I don't have a very clear idea of what setting dir_chunk
actually _does_.

Ganesha's docs are no help here either. They just say:

Dir_Chunk(uint32, range 0 to UINT32_MAX, default 128)
     Size of per-directory dirent cache chunks, 0 means directory
chunking is not enabled.

...but I'm not sure what directory chunking even _is_ and when and why
I'd want to enable or disable it.

If we set it to a non-zero value (or don't set it at all), what sort of
effects can we expect?

-- Jeff

On Tue, 2021-05-25 at 10:35 -0500, Sage Weil wrote:
Adding dev list.

Jeff, is it okay to remove dir_chunk=0 for the cephfs case?

sage

On Tue, May 25, 2021 at 7:43 AM Daniel Gryniewicz <dang@xxxxxxxxxx> wrote:

I think dir_chunk=0 should never be used, even for cephfs.  It's not
intended to be used in general, only for special circumstances (an
out-of-tree FSAL asked for it, and we use it upstream for debugging
readdir), and it may go away in a future version of Ganesha.

The rest is probably okay for both of them.  However, this raises some
issues.  Some settings, such as dir_chunk=0, Attr_Expiration_Time=0, and
only_numeric_onwers=true are global to Ganesha.  This means that, if
CephFS and RGW need different global settings, they'd have to run in
different instances of Ganesha.  Is this something we're interested in?

Daniel

On 5/25/21 8:11 AM, Sebastian Wagner wrote:
Moving this to upstream, as this is an upstream issue.

Hi Mike, hi Sage,

Do we need to rethink how we deploy ganesha daemons? Looks like we need
different ganesha.conf templates for cephfs and rgw.

- Sebastian

Am 25.05.21 um 13:59 schrieb Matt Benjamin:
Hi Sebastian,

1. yes, I think we should use different templates
2. MDCACHE { dir_chunk = 0; } is fatal for RGW NFS--it seems suited to
avoid double caching of vnodes in the cephfs driver, but simply cannot
be used with RGW
3. RGW has some other preferences--for example, some environments
might prefer only_numeric_owners = true;  Sage is already working on
extending cephadm to generate exports differently, which should allow
for multiple tenants

Matt

On Tue, May 25, 2021 at 7:39 AM Sebastian Wagner <sewagner@xxxxxxxxxx>
wrote:
Hi Matt,

This is the ganesha.conf template that we use for both cephfs and rgw:

https://github.com/ceph/ceph/blob/master/src/pybind/mgr/cephadm/templates/services/nfs/ganesha.conf.j2

I have the slight impression that we might need to different templates
for rgw and cephfs?

Best,
Sebastian

...snip...

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx