Bugs item #2933400, was opened at 2010-01-16 15:35 Message generated for change (Comment added) made by masc82 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=893831&aid=2933400&group_id=180599 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 9 Private: No Submitted By: MaSc82 (masc82) >Assigned to: Nobody/Anonymous (nobody) Summary: virtio-blk io errors / data corruption on raw drives > 1 TB Initial Comment: When attaching raw drives > 1 TB, buffer io errors will most likely occur, filesystems get corrupted. Processes (like mkfs.ext4) might freeze completely when filesystems are created on the guest. Here's a typical log excerpt when using mkfs.ext4 on a 1.5 TB drive on a Ubuntu 9.10 guest: (kern.log) Jan 15 20:40:44 q kernel: [ 677.076602] Buffer I/O error on device vde, logical block 366283764 Jan 15 20:40:44 q kernel: [ 677.076607] Buffer I/O error on device vde, logical block 366283765 Jan 15 20:40:44 q kernel: [ 677.076611] Buffer I/O error on device vde, logical block 366283766 Jan 15 20:40:44 q kernel: [ 677.076616] Buffer I/O error on device vde, logical block 366283767 Jan 15 20:40:44 q kernel: [ 677.076621] Buffer I/O error on device vde, logical block 366283768 Jan 15 20:40:44 q kernel: [ 677.076626] Buffer I/O error on device vde, logical block 366283769 (messages) Jan 15 20:40:44 q kernel: [ 677.076534] lost page write due to I/O error on vde Jan 15 20:40:44 q kernel: [ 677.076541] lost page write due to I/O error on vde Jan 15 20:40:44 q kernel: [ 677.076546] lost page write due to I/O error on vde Jan 15 20:40:44 q kernel: [ 677.076599] lost page write due to I/O error on vde Jan 15 20:40:44 q kernel: [ 677.076604] lost page write due to I/O error on vde Jan 15 20:40:44 q kernel: [ 677.076609] lost page write due to I/O error on vde Jan 15 20:40:44 q kernel: [ 677.076613] lost page write due to I/O error on vde Jan 15 20:40:44 q kernel: [ 677.076618] lost page write due to I/O error on vde Jan 15 20:40:44 q kernel: [ 677.076623] lost page write due to I/O error on vde Jan 15 20:40:44 q kernel: [ 677.076628] lost page write due to I/O error on vde Jan 15 20:45:55 q Backgrounding to notify hosts... (The following entries will repeat infinitely, mkfs.ext4 will not exit and cannot be killed) Jan 15 20:49:27 q kernel: [ 1200.520096] mkfs.ext4 D 0000000000000000 0 1839 1709 0x00000000 Jan 15 20:49:27 q kernel: [ 1200.520101] ffff88004e157cb8 0000000000000082 ffff88004e157c58 0000000000015880 Jan 15 20:49:27 q kernel: [ 1200.520115] ffff88004ef6c7c0 0000000000015880 0000000000015880 0000000000015880 Jan 15 20:49:27 q kernel: [ 1200.520118] 0000000000015880 ffff88004ef6c7c0 0000000000015880 0000000000015880 Jan 15 20:49:27 q kernel: [ 1200.520123] Call Trace: Jan 15 20:49:27 q kernel: [ 1200.520157] [<ffffffff810da0f0>] ? sync_page+0x0/0x50 Jan 15 20:49:27 q kernel: [ 1200.520178] [<ffffffff815255f8>] io_schedule+0x28/0x40 Jan 15 20:49:27 q kernel: [ 1200.520182] [<ffffffff810da12d>] sync_page+0x3d/0x50 Jan 15 20:49:27 q kernel: [ 1200.520185] [<ffffffff81525b17>] __wait_on_bit+0x57/0x80 Jan 15 20:49:27 q kernel: [ 1200.520192] [<ffffffff810da29e>] wait_on_page_bit+0x6e/0x80 Jan 15 20:49:27 q kernel: [ 1200.520205] [<ffffffff81078650>] ? wake_bit_function+0x0/0x40 Jan 15 20:49:27 q kernel: [ 1200.520210] [<ffffffff810e44e0>] ? pagevec_lookup_tag+0x20/0x30 Jan 15 20:49:27 q kernel: [ 1200.520213] [<ffffffff810da745>] wait_on_page_writeback_range+0xf5/0x190 Jan 15 20:49:27 q kernel: [ 1200.520217] [<ffffffff810da807>] filemap_fdatawait+0x27/0x30 Jan 15 20:49:27 q kernel: [ 1200.520220] [<ffffffff810dacb4>] filemap_write_and_wait+0x44/0x50 Jan 15 20:49:27 q kernel: [ 1200.520235] [<ffffffff8114ba9f>] __sync_blockdev+0x1f/0x40 Jan 15 20:49:27 q kernel: [ 1200.520239] [<ffffffff8114bace>] sync_blockdev+0xe/0x10 Jan 15 20:49:27 q kernel: [ 1200.520241] [<ffffffff8114baea>] block_fsync+0x1a/0x20 Jan 15 20:49:27 q kernel: [ 1200.520249] [<ffffffff81142f26>] vfs_fsync+0x86/0xf0 Jan 15 20:49:27 q kernel: [ 1200.520252] [<ffffffff81142fc9>] do_fsync+0x39/0x60 Jan 15 20:49:27 q kernel: [ 1200.520255] [<ffffffff8114301b>] sys_fsync+0xb/0x10 Jan 15 20:49:27 q kernel: [ 1200.520271] [<ffffffff81011fc2>] system_call_fastpath+0x16/0x1b In my case I was switching to virtio at one point, but the behaviour didn't show until there was > 1 TB data on the filesystem. very dangerous. Tested using 2 different SATA controllers, 1.5 TB lvm/mdraid, single 1.5 TB drive and 2 TB lvm/mdraid. The behaviour does not occur with if=scsi or if=ide. #2914397 might be related: https://sourceforge.net/tracker/?func=detail&aid=2914397&group_id=180599&atid=893831 This blog post might also relate: http://www.neuhalfen.name/2009/08/05/OpenSolaris_KVM_and_large_IDE_drives/ CPU: Intel Xeon E5430 KVM: qemu-kvm-0.12.1.2 Kernel: 2.6.32.2, x86_64 Guest OS: Verified to occur on guests Ubuntu Linux 9.10 (64-bit) and Gentoo Linux (64-bit) Commandline (atm using ide instead of virtio for large drives as a workaround): qemu-system-x86_64 -S -M pc-0.11 -enable-kvm -m 1500 -smp 2 -name q -uuid 4684a449-0294-6a87-24a0-1f6d7c6e3eba -chardev socket,id=monitor,path=/var/lib/libvirt/qemu/q.monitor,server,nowait -monitor chardev:monitor -boot c -drive cache=writeback,file=/dev/media1/q,if=virtio,index=0,boot=on -drive cache=writeback,file=/dev/media1/q_storage1,if=virtio,index=1 -drive cache=writeback,file=/dev/media2/q_storage2,if=ide,index=0 -drive cache=writeback,file=/dev/media3/q_storage3,if=virtio,index=2 -drive cache=writeback,file=/dev/media4/q_storage4,if=ide,index=1 -drive cache=writeback,file=/dev/media5/q_storage5,if=ide,index=2 -net nic,macaddr=52:54:00:12:34:59,vlan=0,model=virtio,name=virtio.0 -net tap,fd=18,vlan=0,name=tap.0 -serial none -parallel none -usb -usbdevice tablet -vnc 0.0.0.0:2 -k de -vga cirrus ---------------------------------------------------------------------- >Comment By: MaSc82 (masc82) Date: 2010-04-04 16:26 Message: Christoph Hellwig's virtio-large-iovecs patch which fixed #2941282 (http://sourceforge.net/tracker/?func=detail&aid=2941282&group_id=180599&atid=893831) also improves things here. Buffer IO errors seem to be gone, but corruption on large disks remains. mkfs.ext4 on 1.5 TB disk was ok, but soon after starting rsync: Apr 4 15:49:30 q kernel: [ 975.863441] EXT4-fs error (device vdd): ext4_add_entry: bad entry in directory #86376462: directory entry across blocks - offset=0, inode=1025, rec_len=66576, name_len=0 Apr 4 15:50:03 q kernel: [ 1008.142949] EXT4-fs error (device vdd): ext4_mb_generate_buddy: EXT4-fs: group 10545: 0 blocks in bitmap, 32766 in gd Apr 4 15:50:32 q kernel: [ 1037.631463] hrtimer: interrupt too slow, forcing clock min delta to 71999847 ns Apr 4 15:50:54 q kernel: [ 1059.334332] EXT4-fs error (device vdd): ext4_add_entry: bad entry in directory #86376463: directory entry across blocks - offset=0, inode=1025, rec_len=66576, name_len=0 Apr 4 15:52:46 q kernel: [ 1171.534545] EXT4-fs error (device vdd): ext4_add_entry: bad entry in directory #86376463: directory entry across blocks - offset=0, inode=1025, rec_len=66576, name_len=0 Apr 4 15:53:02 q kernel: [ 1187.393262] EXT4-fs error (device vdd): ext4_add_entry: bad entry in directory #86376464: rec_len is smaller than minimal - offset=0, inode=1241866, rec_len=8, name_len=0 Apr 4 15:53:02 q kernel: [ 1187.393559] EXT4-fs error (device vdd): ext4_delete_entry: bad entry in directory #86376464: rec_len is smaller than minimal - offset=0, inode=1241866, rec_len=8, name_len=0 Apr 4 15:53:02 q kernel: [ 1187.393759] EXT4-fs error (device vdd): ext4_add_entry: bad entry in directory #86376464: rec_len is smaller than minimal - offset=0, inode=1241866, rec_len=8, name_len=0 aborting rsync and running fsck.ext4: root@q:/var/log# fsck -nf /dev/vdd fsck from util-linux-ng 2.16 e2fsck 1.41.9 (22-Aug-2009) Pass 1: Checking inodes, blocks, and sizes Inode 86376705 has an invalid extent node (blk 1902592, lblk 0) Clear? no Inode 86376705, i_blocks is 4193888, should be 0. Fix? no Inode 86376707 has an invalid extent node (blk 2455552, lblk 0) Clear? no Inode 86376707, i_blocks is 1432952, should be 0. Fix? no Running additional passes to resolve blocks claimed by more than one inode... Pass 1B: Rescanning for multiply-claimed blocks Multiply-claimed block(s) in inode 86376492: 345538560 345538561 Multiply-claimed block(s) in inode 86376691: 345538560 345538561 Error while iterating over blocks in inode 86376705 (pass1b): Corrupt extent header Error while iterating over blocks in inode 86376707 (pass1b): Corrupt extent header Pass 1C: Scanning directories for inodes with multiply-claimed blocks Pass 1D: Reconciling multiply-claimed blocks (There are 2 inodes containing multiply-claimed blocks.) File ... (inode #86376492, mod time Wed Feb 17 15:37:40 2010) has 2 multiply-claimed block(s), shared with 1 file(s): ... (inode #86376691, mod time Sat Jul 11 18:16:10 2009) Clone multiply-claimed blocks? no Delete file? no File ... (inode #86376691, mod time Sat Jul 11 18:16:10 2009) has 2 multiply-claimed block(s), shared with 1 file(s): ... (inode #86376492, mod time Wed Feb 17 15:37:40 2010) Clone multiply-claimed blocks? no Delete file? no Pass 2: Checking directory structure Missing '.' in directory inode 86376462. Fix? no Invalid inode number for '.' in directory inode 86376462. Fix? no Directory entry for '.' in ... (86376462) is big. Split? no Directory inode 86376463, block 0, offset 0: directory corrupted Salvage? no e2fsck: aborted The same test using 400 GB disk didn't show any errors. ---------------------------------------------------------------------- Comment By: Jason (jyellick) Date: 2010-03-30 07:47 Message: I too am experiencing this bug. My disks are not very large, they are mostly less than 10G, with a few as large as 150G. I see the error occur most during periods of heavy IO like system updates causing the filesystem to remount read only with some data loss. Host: Arch Linux kernel 2.6.33.1 (also observered in 2.6.32) qemu 0.12.3 drbd replicating vm images Guest Arch Linux kernel 2.6.33.1 (also observed in 2.6.32) rootfs is ext4 on devicemapper LUKS partition disk size is less than 10GiB // Guest kernel log Mar 20 09:06:57 jhdhr kernel: end_request: I/O error, dev vda, sector 9270528 Mar 20 09:06:57 jhdhr kernel: end_request: I/O error, dev vda, sector 9270536 Mar 20 09:06:57 jhdhr kernel: end_request: I/O error, dev vda, sector 9271520 Mar 20 09:06:57 jhdhr kernel: end_request: I/O error, dev vda, sector 9272488 Mar 20 09:07:04 jhdhr kernel: JBD2: Detected IO errors while flushing file data on dm-0-8 Mar 20 09:07:09 jhdhr kernel: end_request: I/O error, dev vda, sector 9277384 Mar 20 09:07:09 jhdhr kernel: quiet_error: 2640 callbacks suppressed Mar 20 09:07:09 jhdhr kernel: Buffer I/O error on device dm-0, logical block 1159168 Mar 20 09:07:09 jhdhr kernel: lost page write due to I/O error on dm-0 Mar 20 09:07:09 jhdhr kernel: Buffer I/O error on device dm-0, logical block 1159169 Mar 20 09:07:09 jhdhr kernel: lost page write due to I/O error on dm-0 Mar 20 09:07:09 jhdhr kernel: Buffer I/O error on device dm-0, logical block 1159170 Mar 20 09:07:09 jhdhr kernel: lost page write due to I/O error on dm-0 Mar 20 09:07:09 jhdhr kernel: Buffer I/O error on device dm-0, logical block 1159171 Mar 20 09:07:09 jhdhr kernel: lost page write due to I/O error on dm-0 Mar 20 09:07:11 jhdhr kernel: Buffer I/O error on device dm-0, logical block 1159172 Mar 20 09:07:11 jhdhr kernel: lost page write due to I/O error on dm-0 Mar 20 09:07:11 jhdhr kernel: Buffer I/O error on device dm-0, logical block 1159173 Mar 20 09:07:11 jhdhr kernel: lost page write due to I/O error on dm-0 Mar 20 09:07:11 jhdhr kernel: Buffer I/O error on device dm-0, logical block 1159174 Mar 20 09:07:11 jhdhr kernel: lost page write due to I/O error on dm-0 Mar 20 09:07:11 jhdhr kernel: Buffer I/O error on device dm-0, logical block 1159175 Mar 20 09:07:11 jhdhr kernel: lost page write due to I/O error on dm-0 Mar 20 09:07:11 jhdhr kernel: Buffer I/O error on device dm-0, logical block 1159176 Mar 20 09:07:11 jhdhr kernel: lost page write due to I/O error on dm-0 Mar 20 09:07:11 jhdhr kernel: Buffer I/O error on device dm-0, logical block 1159177 Mar 20 09:07:11 jhdhr kernel: lost page write due to I/O error on dm-0 Mar 20 09:07:11 jhdhr kernel: end_request: I/O error, dev vda, sector 9278392 Mar 20 09:07:11 jhdhr kernel: end_request: I/O error, dev vda, sector 9279264 Mar 20 09:07:11 jhdhr kernel: end_request: I/O error, dev vda, sector 9280272 Mar 20 09:07:11 jhdhr kernel: end_request: I/O error, dev vda, sector 9281280 Mar 20 09:07:11 jhdhr kernel: end_request: I/O error, dev vda, sector 9282288 ... // the above blocks of Buffer I/o and lost page, and end_request alternate for a fair portion of the log ... // Ultimiately ending in the following JBD2: Detected IO errors while flushing file data on dm-0-8 Aborting journal on device dm-0-8. EXT4-fs error (device dm-0): ext4_journal_start_sb: EXT4-fs error (device dm-0): ext4_journal_start_sb: Detected aborted journal EXT4-fs (dm-0): Remounting filesystem read-only Detected aborted journal // The following is actually from a 2.6.32 dump I had, I have not reproduced this in 2.6.33.1, but thought I would post it for completeness Mar 30 00:26:13 jhdhr kernel: end_request: I/O error, dev vda, sector 10000 Mar 30 00:26:15 jhdhr kernel: JBD2: Detected IO errors while flushing file data on dm-0-8 Mar 30 00:26:42 jhdhr kernel: ------------[ cut here ]------------ Mar 30 00:26:42 jhdhr kernel: WARNING: at fs/buffer.c:1159 mark_buffer_dirty+0x74/0x90() Mar 30 00:26:42 jhdhr kernel: Hardware name: Bochs Mar 30 00:26:42 jhdhr kernel: Modules linked in: ext4 mbcache jbd2 crc16 cryptd aes_x86_64 aes_generic xts gf128mul dm_crypt dm_mod sr_mod cdrom floppy ata_piix ata_generic pata_acpi libata scsi_mod virtio_balloon virtio_blk virtio_net virtio_rng virtio_console virtio_pci virtio_ring virtio Mar 30 00:26:42 jhdhr kernel: Pid: 639, comm: jbd2/dm-0-8 Not tainted 2.6.32-ARCH #1 Mar 30 00:26:42 jhdhr kernel: Call Trace: Mar 30 00:26:42 jhdhr kernel: [<ffffffff81057b63>] ? warn_slowpath_common+0x73/0xb0 Mar 30 00:26:42 jhdhr kernel: [<ffffffff8113cda4>] ? mark_buffer_dirty+0x74/0x90 Mar 30 00:26:42 jhdhr kernel: [<ffffffffa013b399>] ? __jbd2_journal_unfile_buffer+0x9/0x20 [jbd2] Mar 30 00:26:42 jhdhr kernel: [<ffffffffa013eb7b>] ? jbd2_journal_commit_transaction+0xcbb/0x1500 [jbd2] Mar 30 00:26:42 jhdhr kernel: [<ffffffffa0146911>] ? kjournald2+0xb1/0x210 [jbd2] Mar 30 00:26:42 jhdhr kernel: [<ffffffff810760b0>] ? autoremove_wake_function+0x0/0x30 Mar 30 00:26:42 jhdhr kernel: [<ffffffffa0146860>] ? kjournald2+0x0/0x210 [jbd2] Mar 30 00:26:42 jhdhr kernel: [<ffffffffa0146860>] ? kjournald2+0x0/0x210 [jbd2] Mar 30 00:26:42 jhdhr kernel: [<ffffffff81075cee>] ? kthread+0x8e/0xa0 Mar 30 00:26:42 jhdhr kernel: [<ffffffff8101311a>] ? child_rip+0xa/0x20 Mar 30 00:26:42 jhdhr kernel: [<ffffffff81075c60>] ? kthread+0x0/0xa0 Mar 30 00:26:42 jhdhr kernel: [<ffffffff81013110>] ? child_rip+0x0/0x20 Mar 30 00:26:42 jhdhr kernel: ---[ end trace 015f5ef643c6f770 ]--- // After the above error in 2.6.32 I see these slightly different ext4 messages JBD2: Detected IO errors while flushing file data on dm-0-8 Aborting journal on device dm-0-8. EXT4-fs error (device dm-0) in ext4_reserve_inode_write: Journal has aborted EXT4-fs (dm-0): delayed block allocation failed for inode 16756 at logical offset 0 with max blocks 1 with error -30 This should not happen!! Data will be lost EXT4-fs error (device dm-0) in ext4_da_writepages: Journal has aborted EXT4-fs error (device dm-0): ext4_journal_start_sb: Detected aborted journal EXT4-fs (dm-0): Remounting filesystem read-only EXT4-fs (dm-0): ext4_da_writepages: jbd2_start: 323 pages, ino 16756; err -30 ---------------------------------------------------------------------- Comment By: MaSc82 (masc82) Date: 2010-03-06 12:24 Message: repeating the test with mkfs.ext4, it now fails as well. seems I've just been lucky before. [ 86.803074] end_request: I/O error, dev vdd, sector 2930245608 [ 86.803084] __ratelimit: 3 callbacks suppressed [ 86.803090] Buffer I/O error on device vdd, logical block 366280701 [ 86.803092] lost page write due to I/O error on vdd [ 480.380107] INFO: task mkfs.ext4:2470 blocked for more than 120 seconds. [ 480.380112] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 480.380115] mkfs.ext4 D 0000000000000000 0 2470 2457 0x00000000 [ 480.380120] ffff88003fd85cb8 0000000000000082 ffff88003fd85cb8 0000000000015880 [ 480.380135] ffff880042c61a60 0000000000015880 0000000000015880 0000000000015880 [ 480.380138] 0000000000015880 ffff880042c61a60 0000000000015880 0000000000015880 [ 480.380142] Call Trace: [ 480.380188] [<ffffffff810da020>] ? sync_page+0x0/0x50 [ 480.380210] [<ffffffff81528978>] io_schedule+0x28/0x40 [ 480.380213] [<ffffffff810da05d>] sync_page+0x3d/0x50 [ 480.380217] [<ffffffff81528e97>] __wait_on_bit+0x57/0x80 [ 480.380220] [<ffffffff810da1ce>] wait_on_page_bit+0x6e/0x80 [ 480.380240] [<ffffffff81078540>] ? wake_bit_function+0x0/0x40 [ 480.380247] [<ffffffff810e4490>] ? pagevec_lookup_tag+0x20/0x30 [ 480.380251] [<ffffffff810da675>] wait_on_page_writeback_range+0xf5/0x190 [ 480.380254] [<ffffffff810da737>] filemap_fdatawait+0x27/0x30 [ 480.380257] [<ffffffff810dabe4>] filemap_write_and_wait+0x44/0x50 [ 480.380273] [<ffffffff8114bd0f>] __sync_blockdev+0x1f/0x40 [ 480.380276] [<ffffffff8114bd3e>] sync_blockdev+0xe/0x10 [ 480.380279] [<ffffffff8114bd5a>] block_fsync+0x1a/0x20 [ 480.380287] [<ffffffff81143196>] vfs_fsync+0x86/0xf0 [ 480.380290] [<ffffffff81143239>] do_fsync+0x39/0x60 [ 480.380293] [<ffffffff8114328b>] sys_fsync+0xb/0x10 [ 480.380306] [<ffffffff81012002>] system_call_fastpath+0x16/0x1b ---------------------------------------------------------------------- Comment By: MaSc82 (masc82) Date: 2010-03-06 12:13 Message: I would be very surprised if ext4 would still have these kind of severe bugs, corrupting fs when it's full, after having stable release since 1,5 years. Just tried to repeat the test with ext3, failing already at mkfs.ext3: [ 507.436112] end_request: I/O error, dev vdd, sector 2930245504 [ 507.436119] __ratelimit: 3 callbacks suppressed [ 507.436125] Buffer I/O error on device vdd, logical block 366280688 [ 507.436127] lost page write due to I/O error on vdd [ 507.436679] end_request: I/O error, dev vdd, sector 2930245568 [ 507.436682] Buffer I/O error on device vdd, logical block 366280696 [ 507.436683] lost page write due to I/O error on vdd [ 507.436687] Buffer I/O error on device vdd, logical block 366280697 [ 507.436689] lost page write due to I/O error on vdd [ 507.436692] Buffer I/O error on device vdd, logical block 366280698 [ 507.436693] lost page write due to I/O error on vdd [ 507.436696] Buffer I/O error on device vdd, logical block 366280699 [ 507.436697] lost page write due to I/O error on vdd [ 507.436700] Buffer I/O error on device vdd, logical block 366280700 [ 507.436702] lost page write due to I/O error on vdd [ 507.436704] Buffer I/O error on device vdd, logical block 366280701 [ 507.436706] lost page write due to I/O error on vdd [ 507.436708] Buffer I/O error on device vdd, logical block 366280702 [ 507.436710] lost page write due to I/O error on vdd [ 507.436712] Buffer I/O error on device vdd, logical block 366280703 [ 507.436714] lost page write due to I/O error on vdd [ 720.380113] INFO: task mkfs.ext3:2645 blocked for more than 120 seconds. [ 720.380118] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 720.380121] mkfs.ext3 D 0000000000000000 0 2645 2458 0x00000004 [ 720.380126] ffff88001259b9a8 0000000000000046 0000000000001000 0000000000015880 [ 720.380141] ffff88003862c7c0 0000000000015880 0000000000015880 0000000000015880 [ 720.380145] 0000000000015880 ffff88003862c7c0 0000000000015880 0000000000015880 [ 720.380149] Call Trace: [ 720.380194] [<ffffffff810da020>] ? sync_page+0x0/0x50 [ 720.380216] [<ffffffff81528978>] io_schedule+0x28/0x40 [ 720.380219] [<ffffffff810da05d>] sync_page+0x3d/0x50 [ 720.380223] [<ffffffff81528e97>] __wait_on_bit+0x57/0x80 [ 720.380226] [<ffffffff810da1ce>] wait_on_page_bit+0x6e/0x80 [ 720.380239] [<ffffffff81078540>] ? wake_bit_function+0x0/0x40 [ 720.380247] [<ffffffff810e4490>] ? pagevec_lookup_tag+0x20/0x30 [ 720.380250] [<ffffffff810da675>] wait_on_page_writeback_range+0xf5/0x190 [ 720.380254] [<ffffffff810da737>] filemap_fdatawait+0x27/0x30 [ 720.380263] [<ffffffff810dabe4>] filemap_write_and_wait+0x44/0x50 [ 720.380279] [<ffffffff8114bd0f>] __sync_blockdev+0x1f/0x40 [ 720.380283] [<ffffffff8114bd3e>] sync_blockdev+0xe/0x10 [ 720.380286] [<ffffffff8114be08>] __blkdev_put+0xa8/0x1a0 [ 720.380289] [<ffffffff811505fb>] ? fsnotify+0xfb/0x140 [ 720.380292] [<ffffffff8114bf0b>] blkdev_put+0xb/0x10 [ 720.380295] [<ffffffff8114bf3e>] blkdev_close+0x2e/0x50 [ 720.380304] [<ffffffff811204ba>] __fput+0xda/0x1f0 [ 720.380307] [<ffffffff811205ed>] fput+0x1d/0x30 [ 720.380315] [<ffffffff8111c6f8>] filp_close+0x58/0x90 [ 720.380329] [<ffffffff810603a7>] put_files_struct+0x77/0xe0 [ 720.380333] [<ffffffff8106045f>] exit_files+0x4f/0x60 [ 720.380339] [<ffffffff810627eb>] do_exit+0x14b/0x360 [ 720.380343] [<ffffffff810dc394>] ? generic_file_aio_write_nolock+0x44/0xb0 [ 720.380346] [<ffffffff81062a49>] do_group_exit+0x49/0xc0 [ 720.380350] [<ffffffff8106f3ab>] get_signal_to_deliver+0x1bb/0x3b0 [ 720.380364] [<ffffffff810119c0>] do_signal+0x70/0x1c0 [ 720.380383] [<ffffffff812f5ef8>] ? tty_write+0x228/0x290 [ 720.380386] [<ffffffff81011b5f>] do_notify_resume+0x4f/0x60 [ 720.380390] [<ffffffff8111f66c>] ? sys_write+0x4c/0x80 [ 720.380393] [<ffffffff810120fb>] sysret_signal+0x7e/0xcf [ 720.380397] INFO: task blkid:2650 blocked for more than 120 seconds. [ 720.380399] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 720.380401] blkid D 0000000000000000 0 2650 1 0x00000000 [ 720.380405] ffff8800082bfc28 0000000000000082 ffff88005842f720 0000000000015880 [ 720.380409] ffff88004b5bde70 0000000000015880 0000000000015880 0000000000015880 [ 720.380412] 0000000000015880 ffff88004b5bde70 0000000000015880 0000000000015880 [ 720.380416] Call Trace: [ 720.380420] [<ffffffff815294f7>] __mutex_lock_slowpath+0xd7/0x160 [ 720.380423] [<ffffffff8114c890>] ? blkdev_open+0x0/0xc0 [ 720.380426] [<ffffffff81528f6e>] mutex_lock+0x1e/0x40 [ 720.380429] [<ffffffff8114c518>] __blkdev_get+0x38/0x3a0 [ 720.380432] [<ffffffff8114c890>] ? blkdev_open+0x0/0xc0 [ 720.380435] [<ffffffff8114c88b>] blkdev_get+0xb/0x10 [ 720.380438] [<ffffffff8114c8fc>] blkdev_open+0x6c/0xc0 [ 720.380441] [<ffffffff8111cad6>] __dentry_open+0xe6/0x2e0 [ 720.380444] [<ffffffff8111cdd7>] nameidata_to_filp+0x57/0x70 [ 720.380452] [<ffffffff8112c42a>] do_filp_open+0x2ca/0xac0 [ 720.380459] [<ffffffff810f3f49>] ? __do_fault+0x419/0x4e0 [ 720.380470] [<ffffffff81036419>] ? default_spin_lock_flags+0x9/0x10 [ 720.380474] [<ffffffff81136da2>] ? alloc_fd+0x102/0x150 [ 720.380477] [<ffffffff8111c8a4>] do_sys_open+0x64/0x160 [ 720.380480] [<ffffffff8111c9cb>] sys_open+0x1b/0x20 [ 720.380483] [<ffffffff81012002>] system_call_fastpath+0x16/0x1b ---------------------------------------------------------------------- Comment By: Michael Tokarev (mjtsf) Date: 2010-03-06 11:32 Message: I too see similar behavour here when switching from 0.11 to 0.12. In my case the filesystems are not large (a few dozens of gigs, 128Gb max). I have error_behavour=remount-ro on all my ext* filesystems, and quite regularly I see one or another filesystem gets remounted read-only, which accomplished with quite some noize in dmesg like the above -- Buffer I/O error on device vdX, logical block ZZZZ and also "lost write on vdX". I don't see the traces as above - these looks like bugs in ext4 code somewhere. Other than that (I/O error on increasingly-numbered blocks) there's nothing interesting anywhere, that's why I didn't report it so far: the errors are quite random and difficult to hit, but I can't gather any useful info - neither on host nor on guest nor from kvm itself. Remounting the filesystem after noticing it is again read-only and it works again, till another incindent. ---------------------------------------------------------------------- Comment By: MaSc82 (masc82) Date: 2010-03-06 10:11 Message: in 0.12.3 behaviour has been significantly improved, but seems it's not fixed. Tested with a 1.5 TB virtio raw drive, I created ext4 fs directly on the drive and rsynced data from a larger source to fill up the drive completely. Until the drive became full, everything looked good, no more buffer i/o errors or lost page writes as they were happening in 0.12.2. I was expecting rsync to abort at one point when disk became full, but it surprisingly continued quite a while, although df -h was already indicating there was no space left. Following messages began to show up in kernel.log. [45517.836999] EXT4-fs error (device vdd): ext4_lookup: deleted inode referenced: 1310728 [45517.847357] EXT4-fs error (device vdd): ext4_lookup: deleted inode referenced: 1310730 [45517.848044] EXT4-fs error (device vdd): ext4_lookup: deleted inode referenced: 1310731 [45517.853757] EXT4-fs error (device vdd): ext4_lookup: deleted inode referenced: 1310733 [45517.864262] EXT4-fs error (device vdd): ext4_lookup: deleted inode referenced: 1310734 [45517.935073] EXT4-fs error (device vdd): ext4_lookup: deleted inode referenced: 1310736 [45518.364535] EXT4-fs error (device vdd): ext4_lookup: deleted inode referenced: 1441857 [45958.938625] EXT4-fs error (device vdd): ext4_lookup: deleted inode referenced: 1441857 [45959.559208] EXT4-fs error (device vdd): ext4_lookup: deleted inode referenced: 1310736 [45959.562137] EXT4-fs error (device vdd): ext4_lookup: deleted inode referenced: 1310734 [45959.564764] EXT4-fs error (device vdd): ext4_lookup: deleted inode referenced: 1310733 [45959.568446] EXT4-fs error (device vdd): ext4_lookup: deleted inode referenced: 1310731 [45959.570541] EXT4-fs error (device vdd): ext4_lookup: deleted inode referenced: 1310730 [45959.580392] EXT4-fs error (device vdd): ext4_lookup: deleted inode referenced: 1310728 [46393.982995] EXT4-fs error (device vdd): ext4_mb_generate_buddy: EXT4-fs: group 8465: 28690 blocks in bitmap, 2047 in gd [46799.823463] EXT4-fs error (device vdd): ext4_lookup: deleted inode referenced: 1441857 [46800.532939] EXT4-fs error (device vdd): ext4_lookup: deleted inode referenced: 1310736 [46800.543741] EXT4-fs error (device vdd): ext4_lookup: deleted inode referenced: 1310734 [46800.547021] EXT4-fs error (device vdd): ext4_lookup: deleted inode referenced: 1310733 [46800.548173] EXT4-fs error (device vdd): ext4_lookup: deleted inode referenced: 1310731 [46800.549303] EXT4-fs error (device vdd): ext4_lookup: deleted inode referenced: 1310730 [46800.561472] EXT4-fs error (device vdd): ext4_lookup: deleted inode referenced: 1310728 [47518.779879] EXT4-fs error (device vdd): ext4_mb_generate_buddy: EXT4-fs: group 8689: 28368 blocks in bitmap, 0 in gd [47686.658580] EXT4-fs error (device vdd): ext4_mb_generate_buddy: EXT4-fs: group 10822: 27802 blocks in bitmap, 1596 in gd [47689.313508] EXT4-fs error (device vdd): ext4_mb_generate_buddy: EXT4-fs: group 9212: 16198 blocks in bitmap, 1158 in gd [47697.030338] EXT4-fs error (device vdd): ext4_mb_generate_buddy: EXT4-fs: group 9473: 28686 blocks in bitmap, 1021 in gd [47726.639706] EXT4-fs error (device vdd): ext4_mb_generate_buddy: EXT4-fs: group 6825: 2047 blocks in bitmap, 1023 in gd [47782.529462] EXT4-fs error (device vdd): ext4_mb_generate_buddy: EXT4-fs: group 252: 358 blocks in bitmap, 102 in gd [47782.929821] EXT4-fs error (device vdd): ext4_mb_generate_buddy: EXT4-fs: group 378: 778 blocks in bitmap, 266 in gd [47782.929945] EXT4-fs error (device vdd): ext4_mb_generate_buddy: EXT4-fs: group 382: 1023 blocks in bitmap, 511 in gd [47783.073044] EXT4-fs error (device vdd): ext4_mb_generate_buddy: EXT4-fs: group 583: 683 blocks in bitmap, 171 in gd [47786.626901] EXT4-fs error (device vdd): ext4_mb_complex_scan_group: 18 free blocks as per group info. But got 82 blocks [47788.300039] EXT4-fs error (device vdd): ext4_mb_complex_scan_group: 18 free blocks as per group info. But got 82 blocks [47788.307045] EXT4-fs error (device vdd): ext4_mb_complex_scan_group: 18 free blocks as per group info. But got 82 blocks [47790.176599] EXT4-fs error (device vdd): ext4_mb_complex_scan_group: 18 free blocks as per group info. But got 82 blocks When rsync finally aborted (reporting no space left on device), the filesystem was corrupt and had to be repaired with fsck.ext4, which detected multiply-claimed blocks especially within most recently written files. ---------------------------------------------------------------------- Comment By: Sebastian (grisia) Date: 2010-01-30 15:03 Message: Same behavior here.... Using QEMU-KVM version 0.11.1 error dosn't appears. Host (Gentoo): Vanilla Kernel Sources 2.6.32.7 QEMU-KVM 0.12.2 Guest image is a LVM2 partition (~30Gb) Guest is using virtio-blk Guest (Gentoo): Vanilla Kernel Sources 2.6.32.7 Filesystem is Ext4 Log after copying lot of files: Jan 30 14:06:02 jboss kernel: [ 40.177314] end_request: I/O error, dev vdd, sector 33555759 Jan 30 14:06:02 jboss kernel: [ 40.177314] Buffer I/O error on device vdd1, logical block 4194462 Jan 30 14:06:02 jboss kernel: [ 40.177314] lost page write due to I/O error on vdd1 ...snip... Jan 30 14:06:02 jboss kernel: [ 40.177314] Buffer I/O error on device vdd1, logical block 4194471 Jan 30 14:06:02 jboss kernel: [ 40.177314] lost page write due to I/O error on vdd1 Jan 30 14:06:02 jboss kernel: [ 40.177314] end_request: I/O error, dev vdd, sector 33556767 Jan 30 14:06:02 jboss kernel: [ 40.177314] end_request: I/O error, dev vdd, sector 37748799 Jan 30 14:06:02 jboss kernel: [ 40.177314] end_request: I/O error, dev vdd, sector 37749055 Jan 30 14:06:02 jboss kernel: [ 40.177314] end_request: I/O error, dev vdd, sector 41943103 Jan 30 14:06:02 jboss kernel: [ 40.177314] end_request: I/O error, dev vdd, sector 41943231 ...snip... Jan 30 14:06:02 jboss kernel: [ 40.535373] end_request: I/O error, dev vdd, sector 58734687 Jan 30 14:06:02 jboss kernel: [ 40.535421] end_request: I/O error, dev vdd, sector 58735695 Jan 30 14:06:02 jboss kernel: [ 40.535476] JBD: recovery failed Jan 30 14:06:02 jboss kernel: [ 40.535478] EXT4-fs (vdd1): error loading journal Jan 30 14:06:26 jboss kernel: [ 63.811349] EXT4-fs (vdd1): mounted filesystem with ordered data mode Jan 30 14:06:32 jboss kernel: [ 70.131957] EXT4-fs (vdc5): mounted filesystem with ordered data mode Jan 30 14:06:48 jboss kernel: [ 85.852773] JBD: barrier-based sync failed on vdd1-8 - disabling barriers Jan 30 14:07:08 jboss kernel: [ 106.391820] end_request: I/O error, dev vdd, sector 21283503 Jan 30 14:07:08 jboss kernel: [ 106.391823] __ratelimit: 12124 callbacks suppressed Jan 30 14:07:08 jboss kernel: [ 106.391826] Buffer I/O error on device vdd1, logical block 2660430 Jan 30 14:07:08 jboss kernel: [ 106.391828] lost page write due to I/O error on vdd1 Jan 30 14:07:08 jboss kernel: [ 106.391830] Buffer I/O error on device vdd1, logical block 2660431 Jan 30 14:07:08 jboss kernel: [ 106.391832] lost page write due to I/O error on vdd1 Jan 30 14:07:08 jboss kernel: [ 106.391834] Buffer I/O error on device vdd1, logical block 2660432 Jan 30 14:07:08 jboss kernel: [ 106.391835] lost page write due to I/O error on vdd1 ...snip... Jan 30 14:07:19 jboss kernel: [ 116.855081] end_request: I/O error, dev vdd, sector 38055999 Jan 30 14:07:19 jboss kernel: [ 116.855146] end_request: I/O error, dev vdd, sector 38056983 Jan 30 14:07:19 jboss kernel: [ 116.855207] end_request: I/O error, dev vdd, sector 38057935 Jan 30 14:07:19 jboss kernel: [ 116.855271] end_request: I/O error, dev vdd, sector 38058935 Jan 30 14:07:19 jboss kernel: [ 116.855339] end_request: I/O error, dev vdd, sector 38059943 Jan 30 14:07:19 jboss kernel: [ 116.855418] end_request: I/O error, dev vdd, sector 38060951 Jan 30 14:07:19 jboss kernel: [ 117.064320] JBD2: Detected IO errors while flushing file data on vdd1-8 Jan 30 14:07:19 jboss kernel: [ 117.064591] Aborting journal on device vdd1-8. Jan 30 14:07:19 jboss kernel: [ 117.064663] EXT4-fs error (device vdd1) in ext4_reserve_inode_write: Journal has aborted Jan 30 14:07:19 jboss kernel: [ 117.064921] EXT4-fs error (device vdd1) in ext4_dirty_inode: Journal has aborted Jan 30 14:07:19 jboss kernel: [ 117.065410] EXT4-fs error (device vdd1): ext4_journal_start_sb: Detected aborted journal Jan 30 14:07:19 jboss kernel: [ 117.065413] EXT4-fs (vdd1): Remounting filesystem read-only qemu -M pc -cpu host -name kvm.jboss -uuid 50259a3c-d6de-4c6c-811d-753f2df9658a -smp 1 -drive file=/dev/lg_virtual_machines/jboss,format=host_device,if=virtio,index=0,boot=on,cache=writeback -drive file=/dev/lg_virtual_machines_ext/jboss_swap,format=host_device,if=virtio,index=1,cache=writeback -drive file=/dev/lg_virtual_machines/yacy,format=host_device,if=virtio,index=2 -drive file=/dev/lg_virtual_machines_ext/yacy_backup,format=host_device,if=virtio,index=3 -net vde,vlan=0,name=jboss,sock=/var/run/vde2/vde_dmz -net nic,vlan=0,macaddr=54:52:02:77:5a:04,model=virtio,name=ifdmzjboss -k de -m 256 -parallel none -serial none -boot c -daemonize -vga std -nographic -balloon virtio -enable-kvm -pidfile /var/run/qemu/kvm.jboss.pid -vnc unix:/var/lib/qemu/vnc_displays/kvm.jboss.socket -monitor unix:/var/lib/qemu/monitors/kvm.jboss.socket,server,nowait ---------------------------------------------------------------------- Comment By: Avi Kivity (avik) Date: 2010-01-17 10:38 Message: 1 TB = 2^40 B = 2^31 sectors. Overflow? ---------------------------------------------------------------------- Comment By: MaSc82 (masc82) Date: 2010-01-16 15:59 Message: updated to prio9, since ppl moving from scsi/ide to virtio with existing filesystems will at one point experience corruption and data loss. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=893831&aid=2933400&group_id=180599 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html