[PATCH] UnbufferedFile improvements v2

art_k at o2.pl (Artur Skawina) · Mon Nov 21 19:05:58 2005

Ralf M?ller wrote:
> On Montag 21 November 2005 02:15, Artur Skawina wrote:
>> client, ie vdr-box, that does not behave well -- it sits almost
>> completely idle for minutes (zero network traffic, no writeback at
>> all), and then goes busy for a second or so.
> 
> But this very much sounds like a NFS-problem - and much less like a VDR 
> problem ...

this is perfectly normal behavior; it's the same as for the local disk case. The 
problem is that since the vdr box isn't under any memory pressure it collects 
all the writes. If not for the fdatasyncs it would start writing the data 
asynchronously after some time, when it would need some RAM or had too many 
dirty pages. The problem is that vdr does not let it do that -- after 10M it 
asks the system to commit all the data to disk and return status. So the box 
does just that -- flushes the data as fast as possible in order to complete the 
synchronous request.
This is were fadvise(WONTNEED) helps -- it tells the system that we're not going 
to access the written data any time soon, so it starts committing that buffered 
data back to disk immediately. Just as it would if it was under memory pressure, 
except now there is none; and once the data gets to disk it no longer needs to 
be treated as dirty and can be easily freed.

>> [...] I had
>> hoped the extra fadvice every 10M would fix that, but i wanted to get
>> the recording and replay cases right first. (the issue when cutting
>> is simply that we need: a) start the writeback, and b) drop the
>> cached data after it has hit the disk. The problem is that we don't
>> really know when to do b...
> 
> Thats exactly the problem here ... without special force my kernel seems 
> to prefer to use memory instead of disk ...

if you have told it to do exactly that, using that reiserfs setting mentioned 
below, well, i guess it tries to do it's best to obey :)

>> For low write rates the heuristic seems 
>> to work, for high rates it might fail. Yes, fdatasync obviously will
>> work, but this is the sledgehammer approach :)
> 
> I know. I also don't like this approach. But at least it worked (here). 
> 
>> The fadvise(0,0) 
>> solution was a first try at using a slightly smaller hammer. Keeping
>> a dirty-list and flushing it after some time would be the next step
>> if fadvise isn't enough.)
> 
> How do you know what is still dirty in case of writes?

The strategy currently is this: after writing some data to the file (~1M) we use 
fadvice to make the kernel start writing it to disk; after some time we call 
fadvice on the same data _again_, now hopefully it has already hit the disk, is 
clean and will be dropped. (I actually call fadvice three, not two, times just 
to be sure). This seems to work fine for slow sequential writes, such as when 
recording; for cutting we create the dirty data faster than it can be written 
back to disk - this is where the global fadvise(WONTNEED) was supposed to help, 
and in the few cutting tests i did seemed to be enough.

>> How does the cache behave when _not_ cutting? Over here it looks ok,
>> i've done several recordings while playing back others, and the cache
>> was basically staying the same. (as this is not a dedicated vdr box
>> it is however sometimes hard to be sure)
> 
> With the active read ahead I even have leaks when only reading - the 
> initiated non-blocking reads of the WILL_NEED seem to keep pages in the 
> buffer caches.

maybe another reiserfs issue? does it occur when sequentially reading, ie on 
normal playback? Or only when also seeking around in the file? In the latter 
case i was seeing some small leaks too, that was the reason for the fadvice 
calls every X jumps.

> My initial intention when trying to use an active read ahead has been to 
> have no hangs even when another disks needs to spin up. On my system I 
> sometimes have this problem and it is annoying. So a read ahead of 
> several megabytes would be needed here - but even without such a huge 
> read ahead I get this annoying leaks here. For normal operation 

hmm, the readahead is only per-file -- do you have filesystems spanning several 
disks, _some_ of which are spun down?

> (replay) they could be avoided by increasing the region which has to be 
> cleared to at least the size of the read ahead.

Isn't this exactly what is currently happening (both w/o and with my patch)?

>> The current vdr behavior isn't really acceptable -- at the very least
>> the fsyncs have to be configurable -- even a few hundred megabytes
>> needlessly dirtied by vdr is still much better than the bursts of
>> traffic, disk and cpu usage. I personally don't mind the cache
>> trashing so much; it would be enough to keep vdr happily running
>> in the background without disturbing other tasks.
> 
> Depends on the use case. You are absolutely right in the NFS case. In 
> the "dedicated to VDR standalone" case this is different. By throwing 

A config option "Write strategy: NORMAL|STREAMING|BURST" would be enough for 
everyone :) (where STREAMING is what my patch does, at least here, BURST is with 
the fdatasyncs followed by fadvice(WONTNEED), and normal is w/o both)

> away the inode cache it makes usage of big recording archives 
> uncomfortable - it takes up to 20 seconds to scan my local recordings 
> directory. Thats a long time when you just want to select a 
> recording ...

It seemed much longer than 20s here :)
Now that vdr caches the list, it's not a big problem anymore.

>> Are saying you don't get any writeback activity w/ my patch?
> 
> Correct. It starts writing back when memory is filled. Not a single 
> second earlier.
> 
>> With no posix_fadvice and no fdatasync calls in the write path i get
>> almost no writeout with multi-megabyte bursts every minute (triggered
>> probably by ext3 journal commit (interval set to 60s) and/or memory
>> pressure).
> 
> Using reiserfs here. I remember having configured it for lazy disk 
> operations ... maybe this is the source for the above results. The idea 
> has been to collect system writes - to not spin up the disks if not 
> absolutely necessary. But this obviously also results in collecting VDR 
> writes ... anyway I think this is a valid case too. At least for 
> dedicated "multimedia" stations ... A bit more control about VDR IO 
> would be a great thing to have.

reiserfs collecting all writes would explain the behavior; whether it's a good 
thing or not in this scenario i'm not sure. Apparently this does not give you 
any way to force disk writes, other than a synchronous flush (ie fdatasync)?...

>> i don't think that you need any manual readahead at all in a
>> normal-playback situation; especially as the kernel will by default
>> do some. It's only when the disk is getting busier that the benefits
>> of readahead show up. At least this is what i saw here)
> 
> Remember - you switched off read ahead: POSIX_FADV_RANDOM
> ;) 

Just before posting v2 :)
Most test were w/ POSIX_FADV_SEQUENTIAL, but as we do the readahead manually i 
decided to see if the kernel wasn't interfering too much. So far haven't seen 
much difference. What did not work was having a large unconditional readahead -- 
this fails spectacularly w/ fast-rewind.

> Anyway - it seems the small read ahead in your patch doesn't had the 
> sightest chance against the multi megabyte write back triggered when 
> buffer cache was on its limits.

well, yes, the readahead is adjusted to the write rate :)

However, one thing that could make a large difference is hardware.
I have two local ATA disks in the vdr machine, both seagates, one older 80G and 
a newer 40M (came w/ the machine, i was too lazy to pull it out so it stayed 
there) Both are alone on an IDE channel, both have 2M cache, both are AFAICT 
identically configured, both have ext3 fs. However the 40M disk is significantly 
slower, and the difference is huge -- you can easily tell when vdr starts using 
that disk, because the increase in latency for unrelated read requests is so 
large. OTOH the 80G disk seems not only way faster, but also much more fair to 
random read requests while writes are going on. Weird.

Regards,

artur