On 11.04.2013 22:30, Zlatko Calusic wrote:
This is something that I've been chasing for months, and I'm getting tired of it. :( The issue has been observed on 4GB RAM x86_64 machines (one server, one desktop) without swap subsystem (not even compiled in). The important thing to remember about a 4GB x86_64 machine is that the NORMAL zone is about 6 times smaller than the DMA32 zone. As picture is 10000 words, I've attached two graphs that nicely show what I've observed. As memory usage slowly rises, the MM subsystem gradually evicts pagecache pages from the NORMAL zone, trying to eventually get rid of all of them! This process takes days, typically more than 5 on this particular server. Of course, this means that eventually the zone will be choke full of anon pages, and without swap, the kernel can't do much about it. But as it tries to balance the zone, various bad things will happen. On the server I've seen sudden freeing of hundreds of MB of pagecache, on the desktop there's a general slowdown, sound dropouts (HTTP streaming) and so...
An excellent Konstantin's patch better described here http://marc.info/?l=linux-mm&m=136731974301311 is already giving some useful additional insight into this problem, just as I expected. Here's the data after 31h of server uptime (also see the attached graph):
Node 0, zone DMA32 nr_inactive_file 443705 avg_age_inactive_file: 362800 Node 0, zone Normal nr_inactive_file 32832 avg_age_inactive_file: 38760I reckon that only aging of the inactive LRU lists is of the interest at the moment, because there's currently a streaming I/O of about 8MB/s that can be seen on the graphs. Here's how I decipher the numbers:
DMA32: 443705 pages * 4k ~ 1733MB, 362800 ms = 362.8 seconds to go through the LRU and replace each page in it, which finally gives: 1733/362.8 ~ 4.78 MB/s (approx speed at which the reclaim is goin' on)
Normal zone: 32832*4/1024/38.76 ~ 3.31 MB/sCheck: 4.78 + 3.31 ~ 8 MB/s (just about the rate of the read I/O from the disk)
So, if my calculations are right and my model makes sense (Konstantin, chime in if I got something wrong!), the reclaim is going through the pages in those 2 zones at a very similar speed, although there's already 13 times less pages in the Normal zone available for streaming I/O caching. If this behavior continues when Normal zone get practically washed out of file pages (guaranteed in a few days), then we will measure TTL of pages in the Normal zone by milliseconds. Not a very useful cache, you'll agree. Of course, it's not a problem for streaming reads, but dirty pages that end there will be written out practically synchronously, and then it's no wonder that the desktop at those moments starts behaving worse than a trusty old 486DX2 with 16MB of RAM once was. :(
The only question I have is, is this a design mistake, or a plain bug?I strongly believe that pages should be reclaimed at speed appropriate to the LRU size. After all, all those pages are the same as far as I/O is concerned, so there's no reason to throw out some pages after only 38 seconds, while others are privileged to spend 6 minutes in the memory? Those are the numbers from the data above, and we'll see by the end of the following week how bad it can really get.
This imbalance is possibly the main reasons why file pages are pushed out from the Normal zone too aggresively in the first place. Probably, if we can balance the reclaim speed, the whole problem would disappear. It looks like faster reclaim in the smaller zone manages to throw out more file pages from it (anon pages replace them easier), which in turn makes the file LRU's even smaller, which produces even faster reclaim, which... you get the idea, kind of a positive feedback loop that feeds on itself. The kind that always ends up with a bang. ;)
-- Zlatko
Attachment:
screenshot12.png
Description: PNG image