Hi I have started to test md raid1 with one ssd and one hdd devices on 3.7.1 kernel (it has trim/discard on raid1). This raid has enabled write behind option and HDD device has enabled write mostly option. Original idea of write mostly option was "Read requests will only be sent if there is no other option." My first simple test workload was a building latest stable kernel (3.7.1) using 16 threads. But i saw some reading from hdd irrespective of a write workload, I saw also more then 1000ms read await while ssd had await about 1ms. (I only used iostat -x.) I wanted to know why. I searched in source codes and i found read_balance function in raid1.c. If I read well this code and understand it - it do: If a device has "write mostly" option and if we still have not selected device for reading (and if is_badblock function is ended with true), code select this device directly. This direct selection may be a mistake because overwrite this direct selection is possible only in special cases - if other possible device (without write mostly option) is idle or a request is a part of sequential reads. Standard way read_balance function is searching the nearest and/or the least used device. Such device is using only if we have not a directly selected device (also from write mosty code path). I thing all code sequence best_disk = disk; continue; in main for loop is not best way and that setting best_padding_disk = disk; best_dist_disk = disk; is better because it give chance find better alternative. In other words - change direct selection to worst possible alternative. But i am not sure in all cases. I made 2 version of a small patch to do it which change direct selection to setting write mostly device only as most distant and most pending possible device. Safe version is safe and reliable for future changes, now version is minimal for current code (up to 3.7.5). This patch work well for me. I can mark ssd as fail, remove from and add in raid under workload without any trouble or additional kernel log items. I attach my patches to email. Best regards Tomas Hodek
diff -ur linux-3.7.1-old/drivers/md/raid1.c linux-3.7.1-new/drivers/md/raid1.c --- linux-3.7.1-old/drivers/md/raid1.c 2012-12-17 20:14:54.000000000 +0100 +++ linux-3.7.1-new/drivers/md/raid1.c 2013-01-09 20:57:47.924610501 +0100 @@ -548,7 +548,7 @@ if (test_bit(WriteMostly, &rdev->flags)) { /* Don't balance among write-mostly, just * use the first as a last resort */ - if (best_disk < 0) { + if (best_dist_disk < 0 || best_pending_disk < 0) { if (is_badblock(rdev, this_sector, sectors, &first_bad, &bad_sectors)) { if (first_bad < this_sector) @@ -557,7 +557,10 @@ best_good_sectors = first_bad - this_sector; } else best_good_sectors = sectors; - best_disk = disk; + if (best_dist_disk < 0) + best_dist_disk = disk; + if (best_pending_disk < 0) + best_pending_disk = disk; } continue; }
diff -ur linux-3.7.1-old/drivers/md/raid1.c linux-3.7.1-new/drivers/md/raid1.c --- linux-3.7.1-old/drivers/md/raid1.c 2012-12-17 20:14:54.000000000 +0100 +++ linux-3.7.1-new/drivers/md/raid1.c 2013-01-09 20:57:47.924610501 +0100 @@ -548,7 +548,7 @@ if (test_bit(WriteMostly, &rdev->flags)) { /* Don't balance among write-mostly, just * use the first as a last resort */ - if (best_disk < 0) { + if (best_dist_disk < 0) { if (is_badblock(rdev, this_sector, sectors, &first_bad, &bad_sectors)) { if (first_bad < this_sector) @@ -557,7 +557,8 @@ best_good_sectors = first_bad - this_sector; } else best_good_sectors = sectors; - best_disk = disk; + best_dist_disk = disk; + best_pending_disk = disk; } continue; }