Re: Regarding ceph rbd write path

Haomai Wang <haomaiwang@xxxxxxxxx> · Tue, 7 Apr 2015 14:51:22 +0800

On Sat, Apr 4, 2015 at 1:20 PM, Somnath Roy <Somnath.Roy@xxxxxxxxxxx> wrote:
> Haomai,
> Yeah, I thought so, but, I didn't know much about that implementation. Good to know that it is taking care of that.
> But, krbd path will still be suboptimal then.
> If we can do something in OSD layer, we may be able to additionally coalesce multiple writes within a PG to a single transaction (of course we need to maintain order). Benefit could be single omap attributes update for multiple object write within a PG.
> May be I should come up with a prototype if you guys are not foreseeing any problem.

I'm not sure but I think it's not a simple way to implement a
effective coalesce multiple transaction via this way.

As for extra metadata, we already have inprogress PR to reduce it as
far as possible.

Anyway, maybe some smart ideas could apply to this problem, look forward it.

>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Haomai Wang [mailto:haomaiwang@xxxxxxxxx]
> Sent: Friday, April 03, 2015 9:47 PM
> To: Somnath Roy
> Cc: ceph-devel@xxxxxxxxxxxxxxx
> Subject: Re: Regarding ceph rbd write path
>
> On Sat, Apr 4, 2015 at 8:30 AM, Somnath Roy <Somnath.Roy@xxxxxxxxxxx> wrote:
>> In fact, we can probably do it from the OSD side like this.
>>
>> 1. A thread in the sharded opWq is taking the ops within a pg by acquiring lock in the pg_for_processing data structure.
>>
>> 2. Now, before taking the job, it can do a bit processing to look for the same object transaction in the map till that time and coalesce that to a single job.
>>
>> Let me know if I am missing anything.
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: ceph-devel-owner@xxxxxxxxxxxxxxx
>> [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Somnath Roy
>> Sent: Friday, April 03, 2015 5:17 PM
>> To: ceph-devel@xxxxxxxxxxxxxxx
>> Subject: Regarding ceph rbd write path
>>
>>
>> Hi Sage/Sam,
>> Here is my understanding on ceph rbd write path.
>>
>> 1. Based on the image order rbd will decide the rados object size, say 4MB.
>>
>> 2. Now, from application say 64K chunks are being written to the rbd image.
>>
>> 3. rbd will calculate the objectids (one of the 4MB objects) and start populating the 4MB objects with 64K chunks.
>>
>> 4. Now, for each of this 64K chunk OSD will write 2 setattrs and the OMAP attrs.
>>
>> If the above flow is correct, it is updating the same metadata for every 64K chunk write to the same object (and same pg). So, my question is, is there any way to optimize (coalesce) that in either rbd/osd layer ?
>> I couldn't find any way in the osd layer as it is holding pg->lock till a transaction complete. But, is there any way in the rbd side so that it can intelligently stage/coalesce the writes for the same object and do a batch commit?
>> This should definitely improve WA/performance for seq writes, may not be much for random though.
>
> Do you consider to use RBDCache? It could cover most of cases you said I think.
>
>>
>> Let me know your opinion on this.
>>
>> Thanks & Regards
>> Somnath
>>
>> ________________________________
>>
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Best Regards,
>
> Wheat

-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html