[PATCH] UnbufferedFile improvements v2

art_k at o2.pl (Artur Skawina) · Mon Nov 21 02:16:00 2005

Ralf M?ller wrote:
> Artur Skawina schrieb:
> 
>> well, vdr w/ the recent cUnbufferedFile changes was flushing the data
>> buffers in huge burst; this was even worse than slowly filling up the
>> caches -- the large (IIRC ~10M) bursts caused latency problems (apps
>> visibly freezing etc).
> 
> Does this freezing apply to local disk access or only to network
> filesystems. My personal VDR is a dedicated to VDR usage system which
> uses a local hard disk for storage. So I don't have applications
> parallel to vdr which can freeze nor I can actually test behaviour on
> network devices. Seems you have both of this extra features so it would
> be nice to know more about this.

the freezing certainly applies to NFS -- it shows clearly if you have
some kind of monitor app graphing network traffic. It may just be the
huge amount of data shifted and associated cpu load, but the delays are
noticeable for non-rt apps running on the same machine. It's rather
obvious when eg watching tv using xawtv while recording.
As to the local disk case -- i'm not sure of the impact -- most of my
vdr data goes over NFS, and this was what made me look at the code.
There could be less of a problem w/ local disks, or I simply didn't
realize the correlation w/ vdr activity as i, unlike network traffic,
do not have a local IO graph on screen :)

(i _think_ i verified w/ vmstat that local disks were not immune to this, but
right now i no longer remember the details, so can't really be sure)

> For local usage I found that IO interruptions of less then a second (10
> MB burst writes on disks which give a hell lot more then 10MB/sec) have
> no negative side effects. But I can imagine that on 10Mbit ethernet it
> could be hard to have these bursts ... I did not think about this when
> writing the initial patch ...

it's a problem even on 100mbit -- while the fileserver certainly can
accept sustained 10M/s data for several seconds (at least), it's the
client, ie vdr-box, that does not behave well -- it sits almost
completely idle for minutes (zero network traffic, no writeback at all),
and then goes busy for a second or so.
I first tried various priority changes, but didn't see any visible
improvement. Having vdr running at low prio isn't really an option
anyway.

Another issue could be the fsync calls -- at least on ext3 these
apparently behave very similar to sync(2)...

>> This patch makes vdr use a much more aggressive disk access strategy.
>> Writes are flushed out almost immediately and the IO is more evenly
>> distributed. While recording and/or replaying the caches do not grow and
>> when vdr is done accessing a video file all cached data from that file
>> is dropped.
> 
> Actually with the patch you attached my cache _does_ grow. It does not
> only grow - it displaces the inode cache, to avoid this the initial
> patch has been created. To make it worse - when cutting a recording and
> have the newly cut recording replayed at the same time I have major
> hangs in replay.

oh, the cutting-trashes-cache-a-bit isn't really such a big surprise --
i was seeing something like that while testing the code -- I had hoped
the extra fadvice every 10M would fix that, but i wanted to get the
recording and replay cases right first. (the issue when cutting is
simply that we need: a) start the writeback, and b) drop the cached data
after it has hit the disk. The problem is that we don't really know when
to do b... For low write rates the heuristic seems to work, for high
rates it might fail. Yes, fdatasync obviously will work, but this is the
sledgehammer approach :) The fadvise(0,0) solution was a first try at
using a slightly smaller hammer. Keeping a dirty-list and flushing it
after some time would be the next step if fadvise isn't enough.)

How does the cache behave when _not_ cutting? Over here it looks ok,
i've done several recordings while playing back others, and the cache
was basically staying the same. (as this is not a dedicated vdr box it
is however sometimes hard to be sure)

> I had a look at your patch - it looked very well. But for whatever
> reason it doesn't do what it is supposed to do at my VDR. I currently
> don't know why it doesn't work here for replay - the code there looked 
> good.

in v1 i was using a relatively small readahead window -- maybe for a
slow disk it was _too_ small. In v2 it's a little bigger, maybe that
will help (i increased it to make sure the readahead worked for
fast-forward, but so far i haven't been able to see much difference).
But I don't usually replay anything while cutting, so this hasn't really
been tested...

(BTW, with the added readahead in the v2 patch, vdr seems to come close
to saturating a 100M connection when cutting. Even when _both_ the
source  and destination are on the same NFSv3 mounted disk, which kind
of surprised me. LocalDisk->NFS rate  and v/v seems to be limited by the
network. I didn't check localdisk->localdisk (lack of sufficient
diskpace). Didn't do any real benchmarking, these are estimations based
on observing the free diskspace decrease rate and network traffic)

> I like the heuristics you used to deal with read ahead - but maybe these
> lead to the leaks I experience here. I will have a look at it. Maybe I
> can find out something about it ...

Please do, I did and posted this to get others to look at that code and
hopefully come up w/ a strategy which works for everyone.
For cutting I was going to switch to O_DIRECT, until i realized we then
would still need a fallback strategy, for old kernels and NFS...

The current vdr behavior isn't really acceptable -- at the very least
the fsyncs have to be configurable -- even a few hundred megabytes
needlessly dirtied by vdr is still much better than the bursts of
traffic, disk and cpu usage.
I personally don't mind the cache trashing so much; it would be enough
to keep vdr happily running in the background without disturbing other
tasks. (one of the reasons is that while keeping the recording list in
cache seems to help local disks, it doesn't really help for NFS -- you
still get lots of NFS traffic every time vdr decides to reread the
directory structure. As both the client and server could fit the dir
tree in ram the limiting factor becomes the network latency)

>> I've tested this w/ both local disks and NFS mounted ones, and it seems
>> to do the right thing. Writes get flushed every 1..2s at a rate of
>> .5..1M/s instead of the >10M bursts. 
> 
> To be honest - I did not found the place where writes get flushed in
> your patch. posix_fadvise() doesn't seem to influence flushing at all.

Hmm, what glibc/kernel?
It works here w/ glibc-2.3.90 and linux-2.6.14.

Here's "vmstat 1" output; vdr (patched 1.3.36) is currently doing a
recording to local disk:

procs -----------memory---------- ---swap-- -----io---- --system------cpu----
  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
  1  0   9168 202120    592  22540    0    0     0     0 3584  1350  0  1 99  0
  0  0   9168 202492    592  22052    0    0     0   800 3596  1330  1  0 99  0
  0  0   9168 202368    592  22356    0    0     0     0 3576  1342  1  0 99  0
  0  0   9168 202492    592  21836    0    0     0   804 3628  1350  0  0 100  0
  0  0   9168 202492    592  22144    0    0     0     0 3573  1346  1  1 98  0
  0  0   9168 202244    592  22452    0    0     0     0 3629  1345  1  0 99  0
  1  0   9168 202492    592  21956    0    0     0   800 3562  1350  0  0 100  0
  0  0   9168 202368    592  22260    0    0     0     0 3619  1353  1  0 99  0
  0  0   9168 202120    592  22568    0    0     0     0 3616  1357  1  1 98  0
  0  0   9168 202492    592  22044    0    0     0   952 3617  1336  0  0 100  0
  0  0   9168 202368    596  22352    0    0     0     0 3573  1356  1  0 99  0
  1  0   9168 202616    596  21724    0    0     0   660 3609  1345  0  0 100  0
  0  0   9168 202616    596  22000    0    0     0     0 3569  1338  1  1 98  0
  0  0   9168 202368    596  22304    0    0     0     0 3573  1335  1  0 99  0
  1  0   9168 202492    596  21956    0    0     0   896 3644  1360  0  1 99  0
  0  0   9168 202492    596  22232    0    0     0     0 3592  1327  1  0 99  0
  0  0   9168 202120    596  22536    0    0     0     0 3571  1333  0  0 100  0
  0  0   9168 202616    596  21968    0    0     0   800 3575  1329 11  3 86  0
  0  0   9168 202368    596  22244    0    0     0     0 3604  1350  1  0 99  0
  0  0   9168 202492    596  21756    0    0     0   820 3585  1326  0  1 99  0
  0  0   9168 202492    612  22060    0    0     8   140 3632  1369  1  1 89  9
  0  0   9168 202244    612  22336    0    0     0     0 3578  1328  1  0 99  0
  0  0   9168 202492    612  21796    0    0     0   784 3619  1360  0  0 100  0
  0  0   9168 202492    628  22072    0    0     8   104 3559  1317  2  0 96  2
  0  0   9168 202244    632  22376    0    0     0     0 3604  1348  1  0 99  0
  0  0   9168 202492    632  21904    0    0     0   800 3695  1402  0  0 100  0
  0  0   9168 202368    632  22180    0    0     0     0 3775  1456  1  1 98  0
  0  0   9168 202120    632  22484    0    0     0     0 3699  1416  0  1 99  0
  0  0   9168 202492    632  21992    0    0     0   804 3774  1465  1  0 99  0
  1  0   9168 202236    632  22268   32    0    32     0 3810  1570  3  1 93  3
  0  0   9168 202360    632  21776    0    0     0   820 3896  1690  1  1 98  0

the 'bo' column shows the writeout caused by vdr. Also note the 'free'
and 'cache' field fluctuate a bit, but do not grow. Hmm, now i noticed
the slowly growing 'buff' -- is this causing you problems?
I didn't mind this here, as there's clearly plenty of free RAM around.
Will have to investigate what happens under some memory pressure.

Are saying you don't get any writeback activity w/ my patch?

With no posix_fadvice and no fdatasync calls in the write path i get
almost no writeout with multi-megabyte bursts every minute (triggered
probably by ext3 journal commit (interval set to 60s) and/or memory
pressure).

> It only applies to already written buffers. So the normal write

/usr/src/linux/mm/fadvise.c should contain the implementation of the various
fadvice modes in a linux 2.6 kernel. It certainly does trigger writeback here.
Both in the local disk case, and on NFS, where it causes a similar traffic pattern.

> strategie is used with your patch - collect data until the kernel
> decides to write it to disk. This leads to "collect about 300MB" here
> and have an up to 300MB burst then. This is a bit more heavy then the
> 10MB bursts before ;)

See vmstat output above. Are you sure you have a working posix_fadvise?
If not, that would also explain the hang during playback as no readahead
was actually taking place... (to be honest, i don't think that you need
any manual readahead at all in a normal-playback situation; especially
as the kernel will by default do some. It's only when the disk is
getting busier that the benefits of readahead show up. At least this is
what i saw here)
What happens when you start a replay and then end it? is the memory
freed immediately?

Thanks for testing and the feedback.

Regards,

artur