Re: [PATCH 1/3] libceph: call r_unsafe_callback when unsafe reply is received

Alex Elder <alex.elder@xxxxxxxxxx> · Tue, 02 Jul 2013 13:11:58 -0500

On 07/02/2013 01:10 PM, Sage Weil wrote:
> On Tue, 2 Jul 2013, Alex Elder wrote:
>> On 06/24/2013 01:41 AM, Yan, Zheng wrote:
>>> From: "Yan, Zheng" <zheng.z.yan@xxxxxxxxx>
>>
>> Sorry it took so long, I intended to take a look at this
>> for you sooner.
>>
>> I would also like to thank you for this nice clear
>> description.  It made it very easy to understand
>> why you were proposing the change, and to focus in
>> on exactly which parts of the design it's affecting.
>>
>>> We can't use !req->r_sent to check if OSD request is sent for the
>>> first time, this is because __cancel_request() zeros req->r_sent
>>> when OSD map changes. Rather than adding a new variable to struct
>>
>> You're right.
>>
>>> ceph_osd_request to indicate if it's sent for the first time, We
>>> can call the unsafe callback only when unsafe OSD reply is received.
>>> If OSD's first reply is safe, just skip calling the unsafe callback.
>>
>> This seems reasonable, but it's different from the way I
>> thought about what constituted "unsafe."  But I may be
>> wrong, and the way this is used by the file system might
>> do something that addresses my concern.
>>
>> The way I interpreted "unsafe" was simply that it was possible
>> a write *could* have been made persistent, even if the client
>> doesn't know about it.  A request could have made it to its
>> target osd, been written, and the response could be in flight
>> at the point something (maybe a router?) crashes and the response
>> gets lost.  During that time window, the stored data may not be
>> in a state that's consistent with the client's view of it.
>>
>> So I thought of "unsafe" as meaning that a write is in flight,
>> and until we get a successful response, the storage might
>> contain the old data or it might contain the new data; the
>> client has no way of knowing which.
>>
>> With that interpretation, a request becomes unsafe the
>> instant it leaves the client, and becomes safe again
>> the instant a response arrives.
>>
>> If my interpretation is correct, this change is wrong.
> 
> The interpretation is correct, but in this case it doesn't matter.  There 
> are two intervals:
> 
>  - write(2) starts
>  - request is sent
>   <interval 1>
>  - got ack reply, write(2) returns
>   <interval 2>
>  - got commit reply
> 
> The important end result is that we need to wait for requests in interval 
> 2 if we fsync().  With your 'unsafe' definition, we *also* wait for 
> syscalls that haven't returned yet, but this isn't necessary... fsync() 
> need only wait for completed but uncommitted writes, not racing ones.  We 
> could quibble about better naming, but the end result is correct.

OK, sounds good to me.  In that case you can include this if you like:

Reviewed-by: Alex Elder <elder@xxxxxxxxxx>

> sage
> 
> 
>>
>> But I may be wrong, and there may really be no need to
>> worry about a possible modification of data until after
>> an acknowledgement response is received.  In that case,
>> I've looked at your patch and it looks good.
>>
>> Can you explain why I'm wrong about what is "unsafe?"
>>
>> 					-Alex
>>
>>> The purpose of unsafe callback is adding unsafe request to a list,
>>> so that fsync(2) can wait for the safe reply. fsync(2) doesn't need
>>> to wait for a write(2) that hasn't returned yet. So it's OK to add
>>> request to the unsafe list when the first OSD reply is received.
>>> (ceph_sync_write() returns after receiving the first OSD reply)
>>>
>>> Signed-off-by: Yan, Zheng <zheng.z.yan@xxxxxxxxx>
>>> ---
>>>  net/ceph/osd_client.c | 14 +++++++-------
>>>  1 file changed, 7 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
>>> index 540dd29..dd47889 100644
>>> --- a/net/ceph/osd_client.c
>>> +++ b/net/ceph/osd_client.c
>>> @@ -1337,10 +1337,6 @@ static void __send_request(struct ceph_osd_client *osdc,
>>>  
>>>  	ceph_msg_get(req->r_request); /* send consumes a ref */
>>>  
>>> -	/* Mark the request unsafe if this is the first timet's being sent. */
>>> -
>>> -	if (!req->r_sent && req->r_unsafe_callback)
>>> -		req->r_unsafe_callback(req, true);
>>>  	req->r_sent = req->r_osd->o_incarnation;
>>>  
>>>  	ceph_con_send(&req->r_osd->o_con, req->r_request);
>>> @@ -1431,8 +1427,6 @@ static void handle_osds_timeout(struct work_struct *work)
>>>  
>>>  static void complete_request(struct ceph_osd_request *req)
>>>  {
>>> -	if (req->r_unsafe_callback)
>>> -		req->r_unsafe_callback(req, false);
>>>  	complete_all(&req->r_safe_completion);  /* fsync waiter */
>>>  }
>>>  
>>> @@ -1559,14 +1553,20 @@ static void handle_reply(struct ceph_osd_client *osdc, struct ceph_msg *msg,
>>>  	mutex_unlock(&osdc->request_mutex);
>>>  
>>>  	if (!already_completed) {
>>> +		if (req->r_unsafe_callback &&
>>> +		    result >= 0 && !(flags & CEPH_OSD_FLAG_ONDISK))
>>> +			req->r_unsafe_callback(req, true);
>>>  		if (req->r_callback)
>>>  			req->r_callback(req, msg);
>>>  		else
>>>  			complete_all(&req->r_completion);
>>>  	}
>>>  
>>> -	if (flags & CEPH_OSD_FLAG_ONDISK)
>>> +	if (flags & CEPH_OSD_FLAG_ONDISK) {
>>> +		if (req->r_unsafe_callback && already_completed)
>>> +			req->r_unsafe_callback(req, false);
>>>  		complete_request(req);
>>> +	}
>>>  
>>>  done:
>>>  	dout("req=%p req->r_linger=%d\n", req, req->r_linger);
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html