Re: Btrfs slowdown

Andrej Podzimek <andrej@xxxxxxxxxxxx> · Mon, 25 Jul 2011 10:51:09 +0200

Hello,

I can see something similar on the machines I maintain, mostly single-disk setups with a 2.6.39 kernel:

	1) Heavy and frequent disk thrashing, although less than 20% of RAM is used and no swap usage is reported.
	2) During the disk thrashing, some processors (usually 2 or 3) spend 100% of their time busy waiting, according to htop.
	3) Some userspace applications freeze for tens of seconds during the thrashing and busy waiting, sometimes even htop itself...

The problem has only been observed on 64-bit multiprocessors (Core i7 laptop and Nehalem class server Xeons). A 32-bit multiprocessor (Intel Core Duo) and a 64-bit uniprocessor (Intel Core 2 Duo class Celeron) do not seem to have any issues.

Furthermore, none of the machines had this problem with 2.6.38 and earlier kernels. Btrfs "just worked" before 2.6.39. I'll test 3.0 today to see whether some of these issues disappear.

Neither ceph nor any other remote/distributed filesystem (not even NFS) runs on the machines.

The second problem listed above looks like illegal blocking of a vital spinlock during a long disk operation, which freezes some kernel subsystems for an inordinate amount of time and causes a number of processors to wait actively for tens of seconds. (Needless to say that this is not acceptable on a laptop...)

Web browsers (Firefox and Chromium) seem to trigger this issue slightly more often than other applications, but I have no detailed statistics to prove this. ;-)

Two Core i7 class multiprocessors work 100% flawlessly with ext4, although their kernel configuration is otherwise identical to the machines that use Btrfs.

Andrej

Hi,

we are running a ceph cluster with btrfs as it's base filesystem
(kernel 3.0). At the beginning everything worked very well, but after
a few days (2-3) things are getting very slow.

When I look at the object store servers I see heavy disk-i/o on the
btrfs filesystems (disk utilization is between 60% and 100%). I also
did some tracing on the Cepp-Object-Store-Daemon, but I'm quite
certain, that the majority of the disk I/O is not caused by ceph or
any other userland process.

When reboot the system(s) the problems go away for another 2-3 days,
but after that, it starts again. I'm not sure if the problem is
related to the kernel warning I've reported last week. At least there
is no temporal relationship between the warning and the slowdown.

Any hints on how to trace this would be welcome.

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Attachment:
smime.p7s

Description: Elektronický podpis S/MIME