sporadic virtio_blk errors and "vcpu not ready for apic_round_robin"

Michael Tokarev <mjt@xxxxxxxxxx> · Fri, 06 Feb 2009 11:00:12 +0300

Hello

Since quite some time, I'm seeing sporadic I/O errors in guests
running ontop of virtio_blk devices.  The information I have is
quite bare: guest usually shows something like:

Feb  6 02:47:34 hobbit kernel: end_request: I/O error, dev vda, sector 9786968
Feb  6 02:47:34 hobbit kernel: Buffer I/O error on device vda7, logical block 473367
Feb  6 02:47:34 hobbit kernel: lost page write due to I/O error on vda7
Feb  6 02:47:34 hobbit kernel: Aborting journal on device vda7.
Feb  6 02:47:35 hobbit kernel: ext3_abort called.
Feb  6 02:47:35 hobbit kernel: EXT3-fs error (device vda7): ext3_journal_start_sb: Detected aborted journal
Feb  6 02:47:35 hobbit kernel: Remounting filesystem read-only

After this point, the system is still live but the corresponding
block device stops working.  I can umount the device, but any
attempt to remount it tells the device is *busy*, and using,
say, cfdisk on it (just starting, attempting to READ the partition
table) results in a kernel OOPS after about a 2 mins of inactivity.
At which time host displays a series of

  vcpu not ready for apic_round_robin

messages (about 20 of them).

I'm trying to capture the OOPS right now.  But obviously the problem
is elsewhere, since that OOPS is far after the original issue (the
I/O errors).

It happens sporadically, sometimes the guest is running for a week,
sometimes (as here) it crashed after several hours of uptime.  It
does not relate to system activity either, as far as I can see --
happens on either high or slightly-loaded system, and may happen
on mostly idle guest system while another high-loaded guest is
running at the same time.

The host is running 2.6.27.10 x86-64 on a AMD Phenom 9750 processor,
AMD 780G/SB700 chipset.  Using stock kvm modules.  Userspace is
32bits kvm-83.  Guests are linux systems running 2.6.27.10 or .14,
32bits, uniprocessor.

After seeing this link -- https://bugs.launchpad.net/ubuntu/+source/kvm/+bug/246175 ,
I disabled cpufreq on host.  Bit it didn't help.

The issue persists since about a month or two (difficult to say as the
problem is very sporadic).  I *think* kvm-72 (for example) exposed the
same problem on this host/guest combination, but I'm not sure.

Any pointers on how to debug the prob, or, even better, if it's a known
issue, is very welcome -- this is a production system and it becomes
quite.. unstable.

Thanks!

/mjt
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html