Re: librados: how to forcibly flush omap value setting to OSD?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Just closing the loop on this as the initial report scared me to half
to death. :) This was a bug in the Ganesha-side code, and....


On Thu, Mar 1, 2018 at 1:13 PM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> On Thu, 2018-03-01 at 14:11 -0500, Jeff Layton wrote:
>> nfs-ganesha can store its client recovery database in a RADOS object. To
>> do this it uses librados with a write_op to store the value in the omap.
>>
>> This generally works, but I've been playing with containerizing ganesha
>> and there I've noticed that occasionally the string that is stored in
>> the value field of the omap is truncated.
>>
>> On my test rig, the value should be 63 bytes, but ends up only being 29.
>> Looking at the wire traffic, it's clear that the omap set operation is
>> not done until the daemon is being shut down, well after the
>> rados_write_op_operate call was issued.

This isn't really a possible failure mode for librados. omap values
are transmitted in atomic messages; if the message is truncated on
transmission the OSD won't do anything with it. Any kind of short omap
value would have to be a bug in memory management, and omap is tested
extensively in our nightlies, so it's *probably* not a bug on the
RADOS side of things.
SImilarly, any invocation of a librados write function is completed
and durable on the OSD before you get a response (either by having the
function return or by having the AioCompletion get triggered, if you
are using sync or async functions).

>>
>> What I think is happening is that we generally end up with exclusive
>> caps on this object, and the client just caches the write operation.
>> When we go to shut down, we don't shut down the connection properly
>> (currently) and that causes it to miss writing out the object.

Caps are a CephFS concept that don't come into direct uses of RADOS at
all — RADOS objects do not have coherent access mechanisms, operations
are received and handled atomically but that's it; librados doesn't
perform any sort of buffering; etc.

>>
>> I think I can probably fix this by shutting things down cleanly in the
>> clean shutdown case, but in this case, we really do need to
>> synchronously write out the omap key to the database even if we hold
>> exclusive caps on on the object. We're storing info to be used after a
>> major outage, and we need this data to be stored properly before we can
>> issue state based on it.
>>
>> So with all of that... how do I force the librados client to not buffer
>> things when rados_write_op_operate is called? I had an initial hope that
>> LIBRADOS_OPERATION_IGNORE_CACHE might do the right thing, but that seems
>> to have more to do with internal OSD operation.

Yeah, that flag is about cache tiers.

Hope this clarifies things and/or eases peoples' minds! :)
-Greg

>>
>
> Oof, I completely misunderstood this problem! I think it's still a
> relevant question (as I'm not certain anything guarantees this), but the
> problem actually seems to be that that client is sending a truncated
> omap value update just before it dies in some cases.
>
> I've opened a tracker bug here:
>
>     http://tracker.ceph.com/issues/23194
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux