Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

Josef Bacik <josef@xxxxxxxxxx> · Tue, 25 Oct 2011 08:23:20 -0400

On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
> 2011/10/24 Josef Bacik <josef@xxxxxxxxxx>:
> > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
> >> [adding linux-btrfs to cc]
> >>
> >> Josef, Chris, any ideas on the below issues?
> >>
> >> On Mon, 24 Oct 2011, Christian Brunner wrote:
> >> >
> >> > - When I run ceph with btrfs snaps disabled, the situation is getting
> >> > slightly better. I can run an OSD for about 3 days without problems,
> >> > but then again the load increases. This time, I can see that the
> >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
> >> > than usual.
> >>
> >> FYI in this scenario you're exposed to the same journal replay issues that
> >> ext4 and XFS are.  The btrfs workload that ceph is generating will also
> >> not be all that special, though, so this problem shouldn't be unique to
> >> ceph.
> >>
> >
> > Can you get sysrq+w when this happens?  I'd like to see what btrfs-endio-write
> > is up to.
> 
> Capturing this seems to be not easy. I have a few traces (see
> attachment), but with sysrq+w I do not get a stacktrace of
> btrfs-endio-write. What I have is a "latencytop -c" output which is
> interesting:
> 
> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph
> tries to balance the load over all OSDs, so all filesystems should get
> an nearly equal load. At the moment one filesystem seems to have a
> problem. When running with iostat I see the following
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdd               0.00     0.00    0.00    4.33     0.00    53.33
> 12.31     0.08   19.38  12.23   5.30
> sdc               0.00     1.00    0.00  228.33     0.00  1957.33
> 8.57    74.33  380.76   2.74  62.57
> sdb               0.00     0.00    0.00    1.33     0.00    16.00
> 12.00     0.03   25.00  19.75   2.63
> sda               0.00     0.00    0.00    0.67     0.00     8.00
> 12.00     0.01   19.50  12.50   0.83
> 
> The PID of the ceph-osd taht is running on sdc is 2053 and when I look
> with top I see this process and a btrfs-endio-writer (PID 5447):
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  2053 root      20   0  537m 146m 2364 S 33.2  0.6  43:31.24 ceph-osd
>  5447 root      20   0     0    0    0 S 22.6  0.0  19:32.18 btrfs-endio-wri
> 
> In the latencytop output you can see that those processes have a much
> higher latency, than the other ceph-osd and btrfs-endio-writers.
> 

I'm seeing a lot of this

        [schedule]      1654.6 msec         96.4 %
                schedule blkdev_issue_flush blkdev_fsync vfs_fsync_range
                generic_write_sync blkdev_aio_write do_sync_readv_writev
                do_readv_writev vfs_writev sys_writev system_call_fastpath

where ceph-osd's latency is mostly coming from this fsync of a block device
directly, and not so much being tied up by btrfs directly.  With 22% CPU being
taken up by btrfs-endio-wri we must be doing something wrong.  Can you run perf
record -ag when this is going on and then perf report so we can see what
btrfs-endio-wri is doing with the cpu.  You can drill down in perf report to get
only what btrfs-endio-wri is doing, so that would be best.  As far as the rest
of the latencytop goes, it doesn't seem like btrfs-endio-wri is doing anything
horribly wrong or introducing a lot of latency.  Most of it seems to be when
running the dleayed refs and having to read in blocks.  I've been suspecting for
a while that the delayed ref stuff ends up doing way more work than it needs to
be per task, and it's possible that btrfs-endio-wri is simply getting screwed by
other people doing work.

At this point it seems like the biggest problem with latency in ceph-osd is not
related to btrfs, the latency seems to all be from the fact that ceph-osd is
fsyncing a block dev for whatever reason.  As for btrfs-endio-wri it seems like
its blowing a lot of CPU time, so perf record -ag is probably going to be your
best bet when it's using lots of cpu so we can figure out what it's spinning on.
Thanks,

Josef

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html