On Thu 16-02-12 12:00:19, Wu Fengguang wrote: > On Tue, Feb 14, 2012 at 02:29:50PM +0100, Jan Kara wrote: > > > > I wonder what happens if you run: > > > > mkdir /cgroup/x > > > > echo 100M > /cgroup/x/memory.limit_in_bytes > > > > echo $$ > /cgroup/x/tasks > > > > > > > > for (( i = 0; i < 2; i++ )); do > > > > mkdir /fs/d$i > > > > for (( j = 0; j < 5000; j++ )); do > > > > dd if=/dev/zero of=/fs/d$i/f$j bs=1k count=50 > > > > done & > > > > done > > > > > > That's a very good case, thanks! > > > > > > > Because for small files the writearound logic won't help much... > > > > > > Right, it also means the native background work cannot be more I/O > > > efficient than the pageout works, except for the overheads of more > > > work items.. > > Yes, that's true. > > > > > > Also the number of work items queued might become interesting. > > > > > > It turns out that the 1024 mempool reservations are not exhausted at > > > all (the below patch as a trace_printk on alloc failure and it didn't > > > trigger at all). > > > > > > Here is the representative iostat lines on XFS (full "iostat -kx 1 20" log attached): > > > > > > avg-cpu: %user %nice %system %iowait %steal %idle > > > 0.80 0.00 6.03 0.03 0.00 93.14 > > > > > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util > > > sda 0.00 205.00 0.00 163.00 0.00 16900.00 207.36 4.09 21.63 1.88 30.70 > > > > > > The attached dirtied/written progress graph looks interesting. > > > Although the iostat disk utilization is low, the "dirtied" progress > > > line is pretty straight and there is no single congestion_wait event > > > in the trace log. Which makes me wonder if there are some unknown > > > blocking issues in the way. > > Interesting. I'd also expect we should block in reclaim path. How fast > > can dd threads progress when there is no cgroup involved? > > I tried running the dd tasks in global context with > > echo $((100<<20)) > /proc/sys/vm/dirty_bytes > > and got mostly the same results on XFS: > > avg-cpu: %user %nice %system %iowait %steal %idle > 0.85 0.00 8.88 0.00 0.00 90.26 > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util > sda 0.00 0.00 0.00 50.00 0.00 23036.00 921.44 9.59 738.02 7.38 36.90 > > avg-cpu: %user %nice %system %iowait %steal %idle > 0.95 0.00 8.95 0.00 0.00 90.11 > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util > sda 0.00 854.00 0.00 99.00 0.00 19552.00 394.99 34.14 87.98 3.82 37.80 OK, so it seems that reclaiming pages in memcg reclaim acted as a natural throttling similar to what balance_dirty_pages() does in the global case. > Interestingly, ext4 shows comparable throughput, however is reporting > near 100% disk utilization: > > avg-cpu: %user %nice %system %iowait %steal %idle > 0.76 0.00 9.02 0.00 0.00 90.23 > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util > sda 0.00 0.00 0.00 317.00 0.00 20956.00 132.21 28.57 82.71 3.16 100.10 > > avg-cpu: %user %nice %system %iowait %steal %idle > 0.82 0.00 8.95 0.00 0.00 90.23 > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util > sda 0.00 0.00 0.00 402.00 0.00 24388.00 121.33 21.09 58.55 2.42 97.40 > > avg-cpu: %user %nice %system %iowait %steal %idle > 0.82 0.00 8.99 0.00 0.00 90.19 > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util > sda 0.00 0.00 0.00 409.00 0.00 21996.00 107.56 15.25 36.74 2.30 94.10 Average request size is smaller so maybe ext4 does more seeking. > > > > Another common case to test - run 'slapadd' command in each cgroup to > > > > create big LDAP database. That does pretty much random IO on a big mmaped > > > > DB file. > > > > > > I've not used this. Will it need some configuration and data feed? > > > fio looks more handy to me for emulating mmap random IO. > > Yes, fio can generate random mmap IO. It's just that this is a real life > > workload. So it is not completely random, it happens on several files and > > is also interleaved with other memory allocations from DB. I can send you > > the config files and data feed if you are interested. > > I'm very interested, thank you! OK, I'll send it in private email... > > > > > +/* > > > > > + * schedule writeback on a range of inode pages. > > > > > + */ > > > > > +static struct wb_writeback_work * > > > > > +bdi_flush_inode_range(struct backing_dev_info *bdi, > > > > > + struct inode *inode, > > > > > + pgoff_t offset, > > > > > + pgoff_t len, > > > > > + bool wait) > > > > > +{ > > > > > + struct wb_writeback_work *work; > > > > > + > > > > > + if (!igrab(inode)) > > > > > + return ERR_PTR(-ENOENT); > > > > One technical note here: If the inode is deleted while it is queued, this > > > > reference will keep it living until flusher thread gets to it. Then when > > > > flusher thread puts its reference, the inode will get deleted in flusher > > > > thread context. I don't see an immediate problem in that but it might be > > > > surprising sometimes. Another problem I see is that if you try to > > > > unmount the filesystem while the work item is queued, you'll get EBUSY for > > > > no apparent reason (for userspace). > > > > > > Yeah, we need to make umount work. > > The positive thing is that if the inode is reaped while the work item is > > queue, we know all that needed to be done is done. So we don't really need > > to pin the inode. > > But I do need to make sure the *inode pointer does not point to some > invalid memory at work exec time. Is this possible without raising > ->i_count? I was thinking about it and what should work is that we have inode reference in work item but in generic_shutdown_super() we go through the worklist and drop all work items for superblock before calling evict_inodes()... Honza -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>