On Thu 11-10-12 13:58:00, Viktor Nagy wrote: > >>>>>The regression you observe is caused by commit 3d08bcc8 "mm: Wait for > >>>>>writeback when grabbing pages to begin a write". At the first sight I was > >>>>>somewhat surprised when I saw that code path in the traces but later when I > >>>>>did some math it's clear. What the commit does is that when a page is just > >>>>>being written out to disk, we don't allow it's contents to be changed and > >>>>>wait for IO to finish before letting next write to proceed. Now if you have > >>>>>1 GB file, that's 256000 pages. By the observation from my test machine, > >>>>>writeback code keeps around 10000 pages in flight to disk at any moment > >>>>>(this number fluctuates a lot but average is around that number). Your > >>>>>program dirties about 25600 pages per second. So the probability one of > >>>>>dirtied pages is a page under writeback is equal to 1 for all practical > >>>>>purposes (precisely it is 1-(1-10000/256000)^25600). Actually, on average > >>>>>you are going to hit about 1000 pages under writeback per second which > >>>>>clearly has a noticeable impact (even single page can have). Pity I didn't > >>>>>do the math when we were considering those patches. > >>>>> > >>>>>There were plans to avoid waiting if underlying storage doesn't need it but > >>>>>I'm not sure how far that plans got (added a couple of relevant CCs). > >>>>>Anyway you are about second or third real workload that sees regression due > >>>>>to "stable pages" so we have to fix that sooner rather than later... Thanks > >>>>>for your detailed report! > >>>>We develop a game server which gets very high load in some > >>>>countries. We are trying to serve as much players as possible with > >>>>one server. > >>>>Currently the CPU usage is below the 50% at the peak times. And with > >>>>the old kernel it runs smoothly. The pdflush runs non-stop on the > >>>>database disk with ~3 MByte/s write (minimal read). > >>>>This is at 43000 active sockets, 18000 rq/s, ~40000 packets/s. > >>>>I think we are still below the theoratical limits of this server... > >>>>but only if the disk writes are never done in sync. > >>>> > >>>>I will try the 3.2.31 kernel without the problematic commit > >>>>(3d08bcc8 "mm: Wait for writeback when grabbing pages to begin a > >>>>write"). > >>>>Is it a good idea? Will it be worse than 2.6.32? > >>> Running without that commit should work just fine unless you use > >>>something exotic like DIF/DIX or similar. Whether things will be worse than > >>>in 2.6.32 I cannot say. For me, your test program behaves fine without that > >>>commit but whether your real workload won't hit some other problem is > >>>always a question. But if you hit another regression I'm interested in > >>>hearing about it :). > >>I've just tested it. After I've set the dirty_bytes over the file > >>size the writes are never blocked. > >>So it's working nice without the mentioned commit. > >> > >>The problem is that if you read the kernel's documentation about the > >>dirty page handling it does not work that way (with the commit) It > >>works unpredictable. > > Which documentation do you mean exatly? The process won't be throttled > >because of dirtying too much memory but we can still block it for other > >reasons - e.g. because we decide to evict it's code from memory and have to > >reload it again when the process gets scheduled. Or we can block during > >memory allocation (which may be needed to allocate a page you write to) if > >we find it necessary. There are no promises really... > > > Ok, it is very hard to get an overview about this whole thing. > I thought I understood the behaviour checking the file > Documentation/sysctl/vm.txt: > > " > dirty_bytes > > Contains the amount of dirty memory at which a process generating > disk writes > will itself start writeback. > ... > " > > Ok, it not says exactly that other things can influence too. > > Several people are trying to get over the problem caused by the > commit with setting the value of /sys/block/sda/queue/nr_requests to > 4 (from 128). > This helped a lot but was not enough for us. Yes, that reduces amount of IO in flight at any moment so it reduces chances you will wait in grab_cache_page_write_begin(). But it also reduces throughput... > I attach two performance graphs which shows our own CPU usage > measurement (red). One minute averages, the blue line is the SQL > time %. > > And a nice question: Without reverting the patch is it possible to > get a smooth performance (in our case)? I don't know how to fix the issue without reverting the patch. Sorry. Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html