Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

Christian Brunner <chb@xxxxxx> · Tue, 25 Oct 2011 17:13:53 +0200

2011/10/25 Josef Bacik <josef@xxxxxxxxxx>:
> On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
>> 2011/10/25 Josef Bacik <josef@xxxxxxxxxx>:
>> > On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
[...]
>> >>
>> >> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph
>> >> tries to balance the load over all OSDs, so all filesystems should get
>> >> an nearly equal load. At the moment one filesystem seems to have a
>> >> problem. When running with iostat I see the following
>> >>
>> >> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> >> avgrq-sz avgqu-sz   await  svctm  %util
>> >> sdd               0.00     0.00    0.00    4.33     0.00    53.33
>> >> 12.31     0.08   19.38  12.23   5.30
>> >> sdc               0.00     1.00    0.00  228.33     0.00  1957.33
>> >> 8.57    74.33  380.76   2.74  62.57
>> >> sdb               0.00     0.00    0.00    1.33     0.00    16.00
>> >> 12.00     0.03   25.00 19.75 2.63
>> >> sda               0.00     0.00    0.00    0.67     0.00     8.00
>> >> 12.00     0.01   19.50  12.50   0.83
>> >>
>> >> The PID of the ceph-osd taht is running on sdc is 2053 and when I look
>> >> with top I see this process and a btrfs-endio-writer (PID 5447):
>> >>
>> >>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>> >>  2053 root      20   0  537m 146m 2364 S 33.2 0.6 43:31.24 ceph-osd
>> >>  5447 root      20   0     0    0    0 S 22.6 0.0 19:32.18 btrfs-endio-wri
>> >>
>> >> In the latencytop output you can see that those processes have a much
>> >> higher latency, than the other ceph-osd and btrfs-endio-writers.
>> >>
>> >
>> > I'm seeing a lot of this
>> >
>> >        [schedule]      1654.6 msec         96.4 %
>> >                schedule blkdev_issue_flush blkdev_fsync vfs_fsync_range
>> >                generic_write_sync blkdev_aio_write do_sync_readv_writev
>> >                do_readv_writev vfs_writev sys_writev system_call_fastpath
>> >
>> > where ceph-osd's latency is mostly coming from this fsync of a block device
>> > directly, and not so much being tied up by btrfs directly.  With 22% CPU being
>> > taken up by btrfs-endio-wri we must be doing something wrong.  Can you run perf
>> > record -ag when this is going on and then perf report so we can see what
>> > btrfs-endio-wri is doing with the cpu.  You can drill down in perf report to get
>> > only what btrfs-endio-wri is doing, so that would be best.  As far as the rest
>> > of the latencytop goes, it doesn't seem like btrfs-endio-wri is doing anything
>> > horribly wrong or introducing a lot of latency.  Most of it seems to be when
>> > running the dleayed refs and having to read in blocks.  I've been suspecting for
>> > a while that the delayed ref stuff ends up doing way more work than it needs to
>> > be per task, and it's possible that btrfs-endio-wri is simply getting screwed by
>> > other people doing work.
>> >
>> > At this point it seems like the biggest problem with latency in ceph-osd is not
>> > related to btrfs, the latency seems to all be from the fact that ceph-osd is
>> > fsyncing a block dev for whatever reason.  As for btrfs-endio-wri it seems like
>> > its blowing a lot of CPU time, so perf record -ag is probably going to be your
>> > best bet when it's using lots of cpu so we can figure out what it's spinning on.
>>
>> Attached is a perf-report. I have included the whole report, so that
>> you can see the difference between the good and the bad
>> btrfs-endio-wri.
>>
>
> We also shouldn't be running run_ordered_operations, man this is screwed up,
> thanks so much for this, I should be able to nail this down pretty easily.

Please note that this is with "btrfs snaps disabled" in the ceph conf.
When I enable snaps our problems get worse (the btrfs-cleaner thing),
but I would be glad if this one thing gets solved. I can run debugging
with snaps enabled, if you want, but I would suggest, that we do this
afterwards.

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html