Hi resend without the perf attachment, which could be found here: http://tuxadero.com/multistorage/perf.report.txt.bz2 Best Regards, martin -------- Original-Nachricht -------- Betreff: Re: ceph on btrfs [was Re: ceph on non-btrfs file systems] Datum: Wed, 26 Oct 2011 22:38:47 +0200 Von: Martin Mailand <martin@xxxxxxxxxxxx> Antwort an: martin@xxxxxxxxxxxx An: Sage Weil <sage@xxxxxxxxxxxx>Kopie (CC): Christian Brunner <chb@xxxxxx>, ceph-devel@xxxxxxxxxxxxxxx, linux-btrfs@xxxxxxxxxxxxxxx
Hi, I have more or less the same setup as Christian and I suffer the same problems. But as far as I can see the output of latencytop and perf differs form Christian one, both are attached. I was wondering about the high latency from btrfs-submit. Process btrfs-submit-0 (970) Total: 2123.5 msec I have as well the high IO rate and high IO wait. avg-cpu: %user %nice %system %iowait %steal %idle 0.60 0.00 2.20 82.40 0.00 14.80 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 0.00 8.40 0.00 74.40 17.71 0.03 3.81 0.00 3.81 3.81 3.20 sdb 0.00 7.00 0.00 269.80 0.00 1224.80 9.08 107.19 398.69 0.00 398.69 3.15 85.00 top - 21:57:41 up 8:41, 1 user, load average: 0.65, 0.79, 0.76 Tasks: 179 total, 1 running, 178 sleeping, 0 stopped, 0 zombie Cpu(s): 0.6%us, 2.4%sy, 0.0%ni, 70.8%id, 25.8%wa, 0.0%hi, 0.3%si, 0.0%st Mem: 4018276k total, 1577728k used, 2440548k free, 10496k buffers Swap: 1998844k total, 0k used, 1998844k free, 1316696k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1399 root 20 0 548m 103m 3428 S 0.0 2.6 2:01.85 ceph-osd 1401 root 20 0 548m 103m 3428 S 0.0 2.6 1:51.71 ceph-osd 1400 root 20 0 548m 103m 3428 S 0.0 2.6 1:50.30 ceph-osd 1391 root 20 0 0 0 0 S 0.0 0.0 1:18.39 btrfs-endio-wri 976 root 20 0 0 0 0 S 0.0 0.0 1:18.11 btrfs-endio-wri 1367 root 20 0 0 0 0 S 0.0 0.0 1:05.60 btrfs-worker-1 968 root 20 0 0 0 0 S 0.0 0.0 1:05.45 btrfs-worker-0 1163 root 20 0 141m 1636 1100 S 0.0 0.0 1:00.56 collectd 970 root 20 0 0 0 0 S 0.0 0.0 0:47.73 btrfs-submit-0 1402 root 20 0 548m 103m 3428 S 0.0 2.6 0:34.86 ceph-osd 1392 root 20 0 0 0 0 S 0.0 0.0 0:33.70 btrfs-endio-met 975 root 20 0 0 0 0 S 0.0 0.0 0:32.70 btrfs-endio-met 1415 root 20 0 548m 103m 3428 S 0.0 2.6 0:28.29 ceph-osd 1414 root 20 0 548m 103m 3428 S 0.0 2.6 0:28.24 ceph-osd 1397 root 20 0 548m 103m 3428 S 0.0 2.6 0:24.60 ceph-osd 1436 root 20 0 548m 103m 3428 S 0.0 2.6 0:13.31 ceph-osd Here ist my setup. Kernel v3.1 + Josef The config for this osd (ceph version 0.37 (commit:a6f3bbb744a6faea95ae48317f0b838edb16a896)) is: [osd.1] host = s-brick-003 osd journal = /dev/sda7 btrfs devs = /dev/sdb btrfs options = noatime filestore_btrfs_snap = false I hope this helps to pin point the problem. Best Regards, martin Sage Weil schrieb:
On Wed, 26 Oct 2011, Christian Brunner wrote:2011/10/26 Sage Weil <sage@xxxxxxxxxxxx>:On Wed, 26 Oct 2011, Christian Brunner wrote:Christian, have you tweaked those settings in your ceph.conf? It would be something like 'journal dio = false'. If not, can you verify that directio shows true when the journal is initialized from your osd log? E.g., 2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open dev/osd0.journal fd 14: 104857600 bytes, block size 4096 bytes, directio = 1 If directio = 1 for you, something else funky is causing those blkdev_fsync's...I've looked it up in the logs - directio is 1: Oct 25 17:20:16 os00 osd.000[1696]: 7f0016841740 journal _open /dev/vg01/lv_osd_journal_0 fd 15: 17179869184 bytes, block size 4096 bytes, directio = 1Do you mind capturing an strace? I'd like to see where that blkdev_fsync is coming from.Here is an strace. I can see a lot of sync_file_range operations.Yeah, these all look like the flusher thread, and shouldn't be hitting blkdev_fsync. Can you confirm that with filestore flusher = false filestore sync flush = false you get no sync_file_range at all? I wonder if this is also perf lying about the call chain.Yes, setting this makes the sync_file_range calls go away.Okay. That means either sync_file_range on a regular btrfs file is triggering blkdev_fsync somewhere in btrfs, there is an extremely sneaky bug that is mixing up file descriptors, or latencytop is lying. I'm guessing the latter, given the other weirdness Josef and Chris were seeing. :)Is it safe to use these settings with "filestore btrfs snap = 0"?Yeah. They're purely a performance thing to push as much dirty data to disk as quickly as possible to minimize the snapshot create latency. You'll notice the write throughput tends to tank when them off. sage
Attachment:
latencytop.txt.bz2
Description: application/bzip