2011/10/25 Josef Bacik <josef@xxxxxxxxxx>: > On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote: >> 2011/10/25 Josef Bacik <josef@xxxxxxxxxx>: >> > On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote: [...] >> >> >> >> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph >> >> tries to balance the load over all OSDs, so all filesystems should get >> >> an nearly equal load. At the moment one filesystem seems to have a >> >> problem. When running with iostat I see the following >> >> >> >> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >> >> avgrq-sz avgqu-sz await svctm %util >> >> sdd 0.00 0.00 0.00 4.33 0.00 53.33 >> >> 12.31 0.08 19.38 12.23 5.30 >> >> sdc 0.00 1.00 0.00 228.33 0.00 1957.33 >> >> 8.57 74.33 380.76 2.74 62.57 >> >> sdb 0.00 0.00 0.00 1.33 0.00 16.00 >> >> 12.00 0.03 25.00 19.75 2.63 >> >> sda 0.00 0.00 0.00 0.67 0.00 8.00 >> >> 12.00 0.01 19.50 12.50 0.83 >> >> >> >> The PID of the ceph-osd taht is running on sdc is 2053 and when I look >> >> with top I see this process and a btrfs-endio-writer (PID 5447): >> >> >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> >> 2053 root 20 0 537m 146m 2364 S 33.2 0.6 43:31.24 ceph-osd >> >> 5447 root 20 0 0 0 0 S 22.6 0.0 19:32.18 btrfs-endio-wri >> >> >> >> In the latencytop output you can see that those processes have a much >> >> higher latency, than the other ceph-osd and btrfs-endio-writers. >> >> >> > >> > I'm seeing a lot of this >> > >> > [schedule] 1654.6 msec 96.4 % >> > schedule blkdev_issue_flush blkdev_fsync vfs_fsync_range >> > generic_write_sync blkdev_aio_write do_sync_readv_writev >> > do_readv_writev vfs_writev sys_writev system_call_fastpath >> > >> > where ceph-osd's latency is mostly coming from this fsync of a block device >> > directly, and not so much being tied up by btrfs directly. With 22% CPU being >> > taken up by btrfs-endio-wri we must be doing something wrong. Can you run perf >> > record -ag when this is going on and then perf report so we can see what >> > btrfs-endio-wri is doing with the cpu. You can drill down in perf report to get >> > only what btrfs-endio-wri is doing, so that would be best. As far as the rest >> > of the latencytop goes, it doesn't seem like btrfs-endio-wri is doing anything >> > horribly wrong or introducing a lot of latency. Most of it seems to be when >> > running the dleayed refs and having to read in blocks. I've been suspecting for >> > a while that the delayed ref stuff ends up doing way more work than it needs to >> > be per task, and it's possible that btrfs-endio-wri is simply getting screwed by >> > other people doing work. >> > >> > At this point it seems like the biggest problem with latency in ceph-osd is not >> > related to btrfs, the latency seems to all be from the fact that ceph-osd is >> > fsyncing a block dev for whatever reason. As for btrfs-endio-wri it seems like >> > its blowing a lot of CPU time, so perf record -ag is probably going to be your >> > best bet when it's using lots of cpu so we can figure out what it's spinning on. >> >> Attached is a perf-report. I have included the whole report, so that >> you can see the difference between the good and the bad >> btrfs-endio-wri. >> > > We also shouldn't be running run_ordered_operations, man this is screwed up, > thanks so much for this, I should be able to nail this down pretty easily. Please note that this is with "btrfs snaps disabled" in the ceph conf. When I enable snaps our problems get worse (the btrfs-cleaner thing), but I would be glad if this one thing gets solved. I can run debugging with snaps enabled, if you want, but I would suggest, that we do this afterwards. Thanks, Christian -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html