Re: filestore flusher = false , correct my problem of constant write (need info on this parameter)

Sage Weil <sage@xxxxxxxxxxx> · Sat, 23 Jun 2012 08:38:22 -0700 (PDT)

On Fri, 22 Jun 2012, Alexandre DERUMIER wrote:
> Hi Sage,
> thanks for your response.
> 
> >>If you turn off the journal compeletely, you will see bursty write commits 
> >>from the perspective of the client, because the OSD is periodically doing 
> >>a sync or snapshot and only acking the writes then. 
> >>If you enable the journal, the OSD will reply with a commit as soon as the 
> >>write is stable in the journal. That's one reason why it is there--file 
> >>system commits of heavyweight and slow. 
> 
> Yes of course, I don't wan't to desactivate journal, using a journal on a fast ssd or nvram is the right way.
> 
> >>If we left the file system to its own devices and did a sync every 10 
> >>seconds, the disk would sit idle while a bunch of dirty data accumulated 
> >>in cache, and then the sync/snapshot would take a really long time. This 
> >>is horribly inefficient (the disk is idle half the time), and useless (the 
> >>delayed write behavior makes sense for local workloads, but not servers 
> >>where there is a client on the other end batching its writes). To prevent 
> >>this, 'filestore flusher' will prod the kernel to flush out any written 
> >>data to the disk quickly. Then, when we get around to doing the 
> >>sync/snapshot it is pretty quick, because only fs metadata and 
> >>just-written data needs to be flushed. 
> 
> mmm, I disagree.
> 
> If you flush quickly, it's works fine with sequential write workload.
> 
> But if you have a lot of random write with 4k block by exemple, you are 
> going to have a lot of disk seeks. The way zfs works or netapp san 
> storage works, they take random writes in a fast journal then flush them 
> sequentially each 20s to slow storage.

Oh, I see what you're getting at.  Yes, that is not ideal for small random 
writes.  There is a branch in ceph.git called wip-flushmin that just sets 
a minimum write size for the flush that will probably do a decent job of 
dealing with this: small writes won't get flushed, large ones will.  
Picking the right value will depend on how expensive seeks are for your 
storage system.

You'll want to cherry-pick just the top commit on top of whatever it is 
you're running...

sage

> 
> To compare with zfs or netapp, I can achieve around 20000io/s on random 
> write 4K with 4GB nvram and 10 x 7200 disk.
> 
> with ceph, i'm around 2000io/s with same config. (3 nodes with 
> 10x7200disk, 2x replication), so around real disk io limit without any 
> write cache.
> 
> 
> So for now, i'm think i'm going to use ssd for my osds,I have 80% random 
> write workload. (no seeks, so no problem to constant random write)
> 
> 
> 
> NTW: maybe wiki is wrong
> http://ceph.com/wiki/OSD_journal
> section Motivation
> "Enterprise products like NetApp filers "cheat" by journaling all writes to NVRAM and then taking their time to flush things out to disk efficiently. This gives you very low-latency writes _and_ efficient disk IO at the expense of hardware."
> 
> This why I thinked ceph worked like this.
> 
> 
> Thanks again,
> 
> -Alexandre
> 
> 
> 
> 
> 
> 
> 
> ----- Mail original ----- 
> 
> De: "Sage Weil" <sage@xxxxxxxxxxx> 
> À: "Alexandre DERUMIER" <aderumier@xxxxxxxxx> 
> Cc: ceph-devel@xxxxxxxxxxxxxxx, "Mark Nelson" <mark.nelson@xxxxxxxxxxx>, "Stefan Priebe" <s.priebe@xxxxxxxxxxxx> 
> Envoyé: Jeudi 21 Juin 2012 18:03:45 
> Objet: Re: filestore flusher = false , correct my problem of constant write (need info on this parameter) 
> 
> Hi Alexandre, 
> 
> [Sorry I didn't follow up earlier; I didn't understand your question.] 
> 
> If you turn off the journal compeletely, you will see bursty write commits 
> from the perspective of the client, because the OSD is periodically doing 
> a sync or snapshot and only acking the writes then. 
> 
> If you enable the journal, the OSD will reply with a commit as soon as the 
> write is stable in the journal. That's one reason why it is there--file 
> system commits of heavyweight and slow. 
> 
> If we left the file system to its own devices and did a sync every 10 
> seconds, the disk would sit idle while a bunch of dirty data accumulated 
> in cache, and then the sync/snapshot would take a really long time. This 
> is horribly inefficient (the disk is idle half the time), and useless (the 
> delayed write behavior makes sense for local workloads, but not servers 
> where there is a client on the other end batching its writes). To prevent 
> this, 'filestore flusher' will prod the kernel to flush out any written 
> data to the disk quickly. Then, when we get around to doing the 
> sync/snapshot it is pretty quick, because only fs metadata and 
> just-written data needs to be flushed. 
> 
> So: the behavior you're seeing is normal, and good. 
> 
> Did I understand your confusion correctly? 
> 
> Thanks! 
> sage 
> 
> 
> On Wed, 20 Jun 2012, Alexandre DERUMIER wrote: 
> 
> > Hi, 
> > I have tried to disabe filestore flusher 
> > 
> > filestore flusher = false 
> > filestore max sync interval = 30 
> > filestore min sync interval = 29 
> > 
> > 
> > in osd config. 
> > 
> > 
> > now, I see correct sync each 30s when doing rados bench 
> > 
> > rados -p pool3 bench 60 write -t 16 
> > 
> > 
> > seekwatcher movie: 
> > 
> > 
> > before 
> > ------ 
> > http://odisoweb1.odiso.net/seqwrite-radosbench-flusherenable.mpg 
> > 
> > after 
> > ----- 
> > http://odisoweb1.odiso.net/seqwrite-radosbench-flusherdisable.mpg 
> > 
> > 
> > Shouldn't it be the normal behaviour ? What's exactly is filestore flusher vs syncfs ? 
> > 
> > 
> > 
> > This seem to works fine with rados bench, 
> > But when I launch benchmark with fio from my guest vm, I see again constant write. 
> > (I'll try to debug that today) 
> > 
> > 
> > My target is to be able to handle small random write and write them each 30s. 
> > 
> > Regards, 
> > 
> > Alexandre 
> > -- 
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> > the body of a message to majordomo@xxxxxxxxxxxxxxx 
> > More majordomo info at http://vger.kernel.org/majordomo-info.html 
> > 
> > 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> the body of a message to majordomo@xxxxxxxxxxxxxxx 
> More majordomo info at http://vger.kernel.org/majordomo-info.html 
> 
> 
> 
> -- 
> 
> -- 
> 
> 
> 
> 	
> 
> Alexandre D e rumier 
> 
> Ingénieur Systèmes et Réseaux 
> 
> 
> Fixe : 03 20 68 88 85 
> 
> Fax : 03 20 68 90 88 
> 
> 
> 45 Bvd du Général Leclerc 59100 Roubaix 
> 12 rue Marivaux 75002 Paris 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
>