>>I was just talking with Elder on IRC yesterday about looking into how >>much small network transfers are hurting us in cases like these. Even >>with SSD based OSDs I haven't seen a very dramatic improvement in small >>request performance. How tough would it be to aggregate requests into >>larger network transactions? There would be a latency penalty of >>course, but we could flush a client side dirty cache pretty quickly and >>still benefit if we are getting bombarded with lots of tiny requests. Yes, I see no improvement with journal on tmpfs ...this is strange.. Also, I have tried with rbd_cache=true, so ios should be already aggregate in bigger transaction. But I didnt't have see any improvement. I'm around 2000 ios. Do you know what is the bottleneck ? rbd protocol (some kind of overhead for each io ?) ----- Mail original ----- De: "Mark Nelson" <mark.nelson@xxxxxxxxxxx> À: "Sage Weil" <sage@xxxxxxxxxxx> Cc: "Alexandre DERUMIER" <aderumier@xxxxxxxxx>, ceph-devel@xxxxxxxxxxxxxxx, "Stefan Priebe" <s.priebe@xxxxxxxxxxxx> Envoyé: Samedi 23 Juin 2012 18:40:28 Objet: Re: filestore flusher = false , correct my problem of constant write (need info on this parameter) On 6/23/12 10:38 AM, Sage Weil wrote: > On Fri, 22 Jun 2012, Alexandre DERUMIER wrote: >> Hi Sage, >> thanks for your response. >> >>>> If you turn off the journal compeletely, you will see bursty write commits >>> >from the perspective of the client, because the OSD is periodically doing >>>> a sync or snapshot and only acking the writes then. >>>> If you enable the journal, the OSD will reply with a commit as soon as the >>>> write is stable in the journal. That's one reason why it is there--file >>>> system commits of heavyweight and slow. >> >> Yes of course, I don't wan't to desactivate journal, using a journal on a fast ssd or nvram is the right way. >> >>>> If we left the file system to its own devices and did a sync every 10 >>>> seconds, the disk would sit idle while a bunch of dirty data accumulated >>>> in cache, and then the sync/snapshot would take a really long time. This >>>> is horribly inefficient (the disk is idle half the time), and useless (the >>>> delayed write behavior makes sense for local workloads, but not servers >>>> where there is a client on the other end batching its writes). To prevent >>>> this, 'filestore flusher' will prod the kernel to flush out any written >>>> data to the disk quickly. Then, when we get around to doing the >>>> sync/snapshot it is pretty quick, because only fs metadata and >>>> just-written data needs to be flushed. >> >> mmm, I disagree. >> >> If you flush quickly, it's works fine with sequential write workload. >> >> But if you have a lot of random write with 4k block by exemple, you are >> going to have a lot of disk seeks. The way zfs works or netapp san >> storage works, they take random writes in a fast journal then flush them >> sequentially each 20s to slow storage. > > Oh, I see what you're getting at. Yes, that is not ideal for small random > writes. There is a branch in ceph.git called wip-flushmin that just sets > a minimum write size for the flush that will probably do a decent job of > dealing with this: small writes won't get flushed, large ones will. > Picking the right value will depend on how expensive seeks are for your > storage system. > > You'll want to cherry-pick just the top commit on top of whatever it is > you're running... I was just talking with Elder on IRC yesterday about looking into how much small network transfers are hurting us in cases like these. Even with SSD based OSDs I haven't seen a very dramatic improvement in small request performance. How tough would it be to aggregate requests into larger network transactions? There would be a latency penalty of course, but we could flush a client side dirty cache pretty quickly and still benefit if we are getting bombarded with lots of tiny requests. Mark -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html -- -- Alexandre D e rumier Ingénieur Systèmes et Réseaux Fixe : 03 20 68 88 85 Fax : 03 20 68 90 88 45 Bvd du Général Leclerc 59100 Roubaix 12 rue Marivaux 75002 Paris -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html