On Fri, 2018-03-02 at 11:52 -0800, Gregory Farnum wrote: > Just closing the loop on this as the initial report scared me to half > to death. :) This was a bug in the Ganesha-side code, and.... > > > On Thu, Mar 1, 2018 at 1:13 PM, Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > On Thu, 2018-03-01 at 14:11 -0500, Jeff Layton wrote: > > > nfs-ganesha can store its client recovery database in a RADOS object. To > > > do this it uses librados with a write_op to store the value in the omap. > > > > > > This generally works, but I've been playing with containerizing ganesha > > > and there I've noticed that occasionally the string that is stored in > > > the value field of the omap is truncated. > > > > > > On my test rig, the value should be 63 bytes, but ends up only being 29. > > > Looking at the wire traffic, it's clear that the omap set operation is > > > not done until the daemon is being shut down, well after the > > > rados_write_op_operate call was issued. > > This isn't really a possible failure mode for librados. omap values > are transmitted in atomic messages; if the message is truncated on > transmission the OSD won't do anything with it. Any kind of short omap > value would have to be a bug in memory management, and omap is tested > extensively in our nightlies, so it's *probably* not a bug on the > RADOS side of things. > SImilarly, any invocation of a librados write function is completed > and durable on the OSD before you get a response (either by having the > function return or by having the AioCompletion get triggered, if you > are using sync or async functions). > > > > > > > What I think is happening is that we generally end up with exclusive > > > caps on this object, and the client just caches the write operation. > > > When we go to shut down, we don't shut down the connection properly > > > (currently) and that causes it to miss writing out the object. > > Caps are a CephFS concept that don't come into direct uses of RADOS at > all — RADOS objects do not have coherent access mechanisms, operations > are received and handled atomically but that's it; librados doesn't > perform any sort of buffering; etc. > > > > > > > I think I can probably fix this by shutting things down cleanly in the > > > clean shutdown case, but in this case, we really do need to > > > synchronously write out the omap key to the database even if we hold > > > exclusive caps on on the object. We're storing info to be used after a > > > major outage, and we need this data to be stored properly before we can > > > issue state based on it. > > > > > > So with all of that... how do I force the librados client to not buffer > > > things when rados_write_op_operate is called? I had an initial hope that > > > LIBRADOS_OPERATION_IGNORE_CACHE might do the right thing, but that seems > > > to have more to do with internal OSD operation. > > Yeah, that flag is about cache tiers. > > Hope this clarifies things and/or eases peoples' minds! :) > -Greg > <facepalm> Thanks for straightening me out, Greg and Jason. I've found the bug in ganesha and things seem to be working properly now! Sorry for the false alarm! -- Jeff Layton <jlayton@xxxxxxxxxx> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html