Data corruption when using dm-crypt over RAID5

Andrey <andrey@xxxxxxxxx> · Tue, 28 Nov 2006 02:22:23 +0300

Hello, all.

This is about well known "dm-crypt is broken and causes massive data 
damage". Please read.

I'm trying to chase data corruption that happens when I use dm-crypt 
over RAID5. About 3 years ago I build a storage server for myself that 
uses same setup (dm-crypt partition over software RAID5 array) and used 
it ever since without any problems. Last week I built another server 
with same setup with only difference that it is x86_64 and disks are 
bigger. I have a random repro of this corruption. I spent several days 
chasing it, but still have not found the cause. Here is my setup:

- 64-bit amd cpu, 1 gb ram, 2.6.18.3 kernel

/proc/mdstat

md1 : active raid5 hdc3[2] hdb2[1] hda2[0]
     778485504 blocks level 5, 64k chunk, algorithm 2 [4/3] [UUU_]

Firstly, the problem can be reproduced by making following steps:
- create crypt target
- copy about 4GB of data (pdgflush takes about 40% CPU, and all free 
memory goes to cache)
- e2fsck Цf

Attached is a scrip I use.

What I verified so far (in order I tried):

this is dm-crypt related. Replacing crypt with linear eliminates all 
corruptions. i.e. "0 18874368 linear /dev/md1 1280" works while "0 
18874368 crypt aes-cbc-plain $KEY128 0 /dev/md1 1280" yields corruptions.

This is not crypto or iv related. Next thing I created another crypto 
algorithm "aesfake" which does nothing but has all characteristics of 
AES (including CPU load). To my surprise problem still reproduces. Below 
is relevant routine from crypto module:
static void aes_fake_encrypt(struct crypto_tfm *tfm, u8 *dst, const u8 *src)
{
   u8 tmp[AES_BLOCK_SIZE];
   aes_enc_blk(tfm, tmp, src);
   memmove(dst,src,AES_BLOCK_SIZE);
}

This is not an issue of wrong sector remap or data corruption. I added a 
huge per-cc allocation in dm-crypt (about 100MB per 9 GB volume) , where 
I was tracking 3 things: crc of a sector, sector number, sector number 
written (i.e. I was rewriting first 4 bytes of each sector with sector 
number, and memorized original data in memory). The "corrupting" and 
"check" for sector number were done in crypt_convert_scatterlist. In 
addition, I added a code that calculated and memorized sector checksum 
on write and verified it on reads. Checksum calculation was in map, on a 
virgin bio before dm-crypt did anything to it. Verification was right 
after decryption, on cloned bio's. All of this yelded NOTHING, i.e. 
everything worked as expected.

The problem SEEMS to be in a read cycle. Since corruptions reproduce 
even with a dummy no-op cipher, I started to disable various paths in 
dm-crypt. I do this by converting dm-crypt to linear for some boi's by 
adding following code to the top of map function:
   int bypass = 0;

   // bypass certain requests
   if (bio_data_dir(bio) == WRITE) bypass=1;
//    if (bio_data_dir(bio) == READ) bypass=1;

   if (bypass)
   {
       bio->bi_bdev = cc->dev->bdev;
       bio->bi_sector = cc->start + sector;

       mempool_free(io, cc->io_pool);
       return 1;
   }
It turns out that having a read path is a requirement for trouble. If I 
bypass write problem still persist, but bypassing read eliminates the 
problem completely.

So far my theory is that some reads are just disappearing under heavy 
load. This is the only thing I can think of, why all crc/sector checks 
pass (they occur in endio routine) but corruptions still occur.

Does it make any sense? Anyone has any ideas what to check next? Is 
there any special kernel branch that has this problem fixed long ago? 
Any comments are welcome.

       Thanks, Andrey.

#!/bin/sh

#umount if mounted
umount /dev/mapper/v0
dmsetup remove v0  2> /dev/null

echo Dropping cache...
sync
echo 3 > /proc/sys/vm/drop_caches
sync

modprobe dm-crypt

# create 9-gb crypt mapping with offset of 10 64k blocks (luks emulation)

# we need a random key for e2fsck errors to pop up and to trigger trk crc failures
KEY128=`dd status=noxfer count=16 if=/dev/random bs=1 | md5sum | cut -d \  -f 1`
echo key=$KEY128

#echo "0 18874368 crypt aes-cbc-plain $KEY128 0 /dev/md1 1280" | dmsetup create v0
echo "0 18874368 crypt aesfake-ecb $KEY128 0 /dev/md1 1280" | dmsetup create v0
#echo "0 18874368 linear /dev/md1 1280" | dmsetup create v0

mke2fs -b 4096 -R stride=16 /dev/mapper/v0 1572864
mount /dev/mapper/v0 /mnt/vault
echo Copying files...
time unrar x -inul /home/lelik/tf1.rar /mnt/vault
umount /dev/mapper/v0

echo Dropping cache...
sync
echo 3 > /proc/sys/vm/drop_caches
sync

e2fsck /dev/mapper/v0           # <- this returns "clean"
e2fsck -f /dev/mapper/v0        # <- this reports corrupted inodes

---------------------------------------------------------------------
dm-crypt mailing list - http://www.saout.de/misc/dm-crypt/
To unsubscribe, e-mail: dm-crypt-unsubscribe@xxxxxxxx
For additional commands, e-mail: dm-crypt-help@xxxxxxxx