Re: [PATCH RFC 2/2] ceph: truncate the file contents when needed when file scrypted

Xiubo Li <xiubli@xxxxxxxxxx> · Mon, 11 Oct 2021 23:16:39 +0800

On 10/11/21 9:29 PM, Jeff Layton wrote:
On Sat, 2021-09-25 at 17:56 +0800, Xiubo Li wrote:
On 9/14/21 3:34 AM, Jeff Layton wrote:
On Mon, 2021-09-13 at 13:42 +0800, Xiubo Li wrote:
On 9/10/21 7:46 PM, Jeff Layton wrote:
[...]
Are you certain that Fw caps is enough to ensure that no other client
holds Fr caps?
I spent hours and went through the mds Locker related code on the weekends.

   From the mds/lock.cc code, for mds filelock for example in the LOCK_MIX
state and some interim transition states to LOCK_MIX it will allow
different clients could hold any of Fw or Fr caps. But the Fb/Fc will be
disabled. Checked the mds/Locker.cc code, found that the mds filelock
could to switch LOCK_MIX state in some cases when there has one client
wants Fw and another client wants any of Fr and Fw.

In this case I think the Linux advisory or mandatory locks are necessary
to keep the file contents concurrency. In multiple processes' concurrent
read/write or write/write cases without the Linux advisory/mandatory
locks the file contents' concurrency won't be guaranteed, so the logic
is the same here ?

If so, couldn't we just assume the Fw vs Fw and Fr vs Fw caps should be
exclusive in correct use case ? For example, just after the mds filelock
state switches to LOCK_MIX, if clientA gets the advisory file lock and
the Fw caps, and even another clientB could be successfully issued the
Fr caps, the clientB won't do any read because it should be still stuck
and be waiting for the advisory file lock.

I'm not sure I like that idea. Basically, that would change the meaning
of the what Frw caps represent, in a way that is not really consistent
with how they have been used before.

We could gate that new behavior on the new feature flags, but it sounds
pretty tough.

I think we have a couple of options:

1) we could just make the clients request and wait on Fx caps when they
do a truncate. They might stall for a bit if there is contention, but it
would ensure consistency and the client could be completely in charge of
the truncate. [a]

2) we could rev the protocol, and have the client send along the last
block to be written along with the SETATTR request.
I am also thinking send the last block along with SETATTR request, it
must journal the last block too, I am afraid it will occupy many cephfs
meta pool in corner case, such as when client send massive truncate
requests in a short time, just like in this bug:
https://tracker.ceph.com/issues/52280.

Good point.

Yes, we'd need to buffer the last block on a truncate like this, but we
could limit the amount of truncates with "last block" operations that
run concurrently. We'd probably also want to cap the size of the "last
block" too.

Okay. So by far this seems the best approach.

I will try to it tomorrow.

   Maybe we even
consider just adding a new TRUNCATE call independent of SETATTR. The MDS
would remain in complete control of it at that point.
Maybe we can just do:

When the MDS received a SETATTR request with size changing from clientA,
it will try to xlock(filelock), during which the MDS will always only
allow Fcb caps to all the clients, so another client could still
buffering the last block.

I think we can just nudge the journal log for this request in MDS and do
not do the early reply to let the clientA's truncate request to wait.
When the journal log is successfully flushed and before releasing the
xlock(filelock), we can tell the clientA to do the RMW for the last
block. Currently while the xlock is held, no client could get the Frw
caps, so we need to add one interim xlock state to only allow the
xlocker clientA to have the Frw, such as:

[LOCK_XLOCKDONE_TRUNC]  = { LOCK_LOCK, false, LOCK_LOCK, 0, XCL, 0,
0,   0,   0,   0,  0,0,CEPH_CAP_GRD|CEPH_CAP_GWR,0 },

So for clientA it will be safe to do the RMW, after this the MDS will
finished the truncate request with safe reply only.

This sounds pretty fragile. I worry about splitting responsibility for
truncates across two different entities (MDS and client). That means a
lot more complex failure cases.

Yeah, this will be more complex to handle the failure cases.

What will you do if you do this, and then the client dies before it can
finish the RMW? How will you know when the client's RMW cycle is
complete? I assume it'll have to send a "truncate complete" message to
the MDS in that case to know when it can release the xlock?

Okay, I didn't foresee this case, this sounds making it very complex...
The other ideas I've considered seem more complex and don't offer any
significant advantages that I can see.

[a]: Side question: why does buffering a truncate require Fx and not Fb?
How do Fx and Fb interact?

[...]