Re: filestore flusher = false , correct my problem of constant write (need info on this parameter)

Alexandre DERUMIER <aderumier@xxxxxxxxx> · Sat, 23 Jun 2012 19:46:26 +0200 (CEST)

>>I was just talking with Elder on IRC yesterday about looking into how 
>>much small network transfers are hurting us in cases like these. Even 
>>with SSD based OSDs I haven't seen a very dramatic improvement in small 
>>request performance. How tough would it be to aggregate requests into 
>>larger network transactions? There would be a latency penalty of 
>>course, but we could flush a client side dirty cache pretty quickly and 
>>still benefit if we are getting bombarded with lots of tiny requests. 

Yes, I see no improvement with journal on tmpfs ...this is strange..

Also, I have tried with rbd_cache=true, so ios should be already aggregate in bigger transaction.
But I didnt't have see any improvement.

I'm around 2000 ios.

Do you know what is the bottleneck ? rbd protocol (some kind of overhead for each io ?)

----- Mail original ----- 

De: "Mark Nelson" <mark.nelson@xxxxxxxxxxx> 
À: "Sage Weil" <sage@xxxxxxxxxxx> 
Cc: "Alexandre DERUMIER" <aderumier@xxxxxxxxx>, ceph-devel@xxxxxxxxxxxxxxx, "Stefan Priebe" <s.priebe@xxxxxxxxxxxx> 
Envoyé: Samedi 23 Juin 2012 18:40:28 
Objet: Re: filestore flusher = false , correct my problem of constant write (need info on this parameter) 

On 6/23/12 10:38 AM, Sage Weil wrote: 
> On Fri, 22 Jun 2012, Alexandre DERUMIER wrote: 
>> Hi Sage, 
>> thanks for your response. 
>> 
>>>> If you turn off the journal compeletely, you will see bursty write commits 
>>> >from the perspective of the client, because the OSD is periodically doing 
>>>> a sync or snapshot and only acking the writes then. 
>>>> If you enable the journal, the OSD will reply with a commit as soon as the 
>>>> write is stable in the journal. That's one reason why it is there--file 
>>>> system commits of heavyweight and slow. 
>> 
>> Yes of course, I don't wan't to desactivate journal, using a journal on a fast ssd or nvram is the right way. 
>> 
>>>> If we left the file system to its own devices and did a sync every 10 
>>>> seconds, the disk would sit idle while a bunch of dirty data accumulated 
>>>> in cache, and then the sync/snapshot would take a really long time. This 
>>>> is horribly inefficient (the disk is idle half the time), and useless (the 
>>>> delayed write behavior makes sense for local workloads, but not servers 
>>>> where there is a client on the other end batching its writes). To prevent 
>>>> this, 'filestore flusher' will prod the kernel to flush out any written 
>>>> data to the disk quickly. Then, when we get around to doing the 
>>>> sync/snapshot it is pretty quick, because only fs metadata and 
>>>> just-written data needs to be flushed. 
>> 
>> mmm, I disagree. 
>> 
>> If you flush quickly, it's works fine with sequential write workload. 
>> 
>> But if you have a lot of random write with 4k block by exemple, you are 
>> going to have a lot of disk seeks. The way zfs works or netapp san 
>> storage works, they take random writes in a fast journal then flush them 
>> sequentially each 20s to slow storage. 
> 
> Oh, I see what you're getting at. Yes, that is not ideal for small random 
> writes. There is a branch in ceph.git called wip-flushmin that just sets 
> a minimum write size for the flush that will probably do a decent job of 
> dealing with this: small writes won't get flushed, large ones will. 
> Picking the right value will depend on how expensive seeks are for your 
> storage system. 
> 
> You'll want to cherry-pick just the top commit on top of whatever it is 
> you're running... 

I was just talking with Elder on IRC yesterday about looking into how 
much small network transfers are hurting us in cases like these. Even 
with SSD based OSDs I haven't seen a very dramatic improvement in small 
request performance. How tough would it be to aggregate requests into 
larger network transactions? There would be a latency penalty of 
course, but we could flush a client side dirty cache pretty quickly and 
still benefit if we are getting bombarded with lots of tiny requests. 

Mark 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@xxxxxxxxxxxxxxx 
More majordomo info at http://vger.kernel.org/majordomo-info.html 

-- 

-- 

Alexandre D e rumier 

Ingénieur Systèmes et Réseaux 

Fixe : 03 20 68 88 85 

Fax : 03 20 68 90 88 

45 Bvd du Général Leclerc 59100 Roubaix 
12 rue Marivaux 75002 Paris 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html