Re: Giant + nfs over cephfs hang tasks

Andrei Mikhailovsky <andrei@xxxxxxxxxx> · Mon, 1 Dec 2014 10:39:56 +0000 (GMT)

Ilya, 

I will try doing that once again tonight as this is a production cluster and when dds trigger that dmesg error the cluster's io becomes very bad and I have to reboot the server to get things on track. Most of my vms start having 70-90% iowait until that server is rebooted.

I've actually checked what you've asked last time i've ran the test.

When I do 4 dds concurrently nothing aprears in the dmesg output. No messages at all.

The kern.log file that i've sent last time is what I got about a minute after i've started 8 dds. I've pasted the full output. The 8 dds did actually complete, but it took a rather long time. I was getting about 6MB/s per dd process compared to around 70MB/s per dd process when 4 dds were running. Do you still want me to run this or is the information i've provided enough?

Cheers

Andrei 

From: "Ilya Dryomov" <ilya.dryomov@xxxxxxxxxxx>
To: "Andrei Mikhailovsky" <andrei@xxxxxxxxxx>
Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "Gregory Farnum" <greg@xxxxxxxxxxx>
Sent: Monday, 1 December, 2014 8:22:08 AM
Subject: Re:  Giant + nfs over cephfs hang tasks

On Mon, Dec 1, 2014 at 12:30 AM, Andrei Mikhailovsky <andrei@xxxxxxxxxx> wrote:
>
> Ilya, further to your email I have switched back to the 3.18 kernel that
> you've sent and I got similar looking dmesg output as I had on the 3.17
> kernel. Please find it attached for your reference. As before, this is the
> command I've ran on the client:
>
>
> time dd if=/dev/zero of=4G00 bs=4M count=5K oflag=direct & time dd
> if=/dev/zero of=4G11 bs=4M count=5K oflag=direct &time dd if=/dev/zero
> of=4G22 bs=4M count=5K oflag=direct &time dd if=/dev/zero of=4G33 bs=4M
> count=5K oflag=direct & time dd if=/dev/zero of=4G44 bs=4M count=5K
> oflag=direct & time dd if=/dev/zero of=4G55 bs=4M count=5K oflag=direct
> &time dd if=/dev/zero of=4G66 bs=4M count=5K oflag=direct &time dd
> if=/dev/zero of=4G77 bs=4M count=5K oflag=direct &

Can you run that command again - on 3.18 kernel, to completion - and
paste

- the entire dmesg
- "time" results for each dd

?

Compare those to your results with four dds (or any other number which
doesn't trigger page allocation failures).

Thanks,

                Ilya

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com