Hi Neil, Thank you for your comments. Yes, apparently in this case md/raid1 was not the cause. I studied the code more by adding prints and following the resync flow, but cannot see any obvious problem. I did find another case, in which raid1-resync can read phantom data, although this is not what happened to us: # raid1 has 3 disks A,B,C and is resyncing after an unclean shutdown. sync_request always selects disk=A as read_disk. # application reads from far sector (beyond next_resync), so read_balance() selects disk=A to read from (it is the first one) # disk A fails # resync aborts and restarts, now sync_request reads from B and syncs into C # application reads again from the same far sector, now read_balance() selects disk B to read from So potentially we could get a different data from these two reads. In our case, though, there were no disk failures. FWIW, the raid1 code I was once responsible for, treated this situation as follows: # READ comes from application # raid1 sees that it is resyncing, so it locks the relevant area of the raid1 and syncs it. Then it unlocks and proceeds to serve the READ normally # resync thread comes to appropriate area, locks it and sees that it has already been synced (bits are off in the bitmap), so it proceeds further However in md/raid1, there is no mechanism currently that can lock a part of the raid. We only have raise_barrier/wait_barrier that effectively locks the whole capacity. Is it, for example, reasonable to READ the data as you normally do, then to trigger a WRITE with the same data and only then to complete the original READ? There are a lot of inefficiencies here, I know, like re-writing the same data again on read_disk, and syncing this data again later. (I know, patches are welcome...) Thanks, Alex. On Tue, Jan 7, 2014 at 1:56 AM, NeilBrown <neilb@xxxxxxx> wrote: > On Mon, 6 Jan 2014 18:58:30 +0200 Alexander Lyakas <alex.bolshoy@xxxxxxxxx> > wrote: > >> Hi Neil, >> >> looking again at resync/rebuild flow in raid1, I see that sync_sequest() calls >> raise_barrier(conf); >> and only then sets >> conf->next_resync = sector_nr; >> >> while a READ request calls: >> wait_barrier(conf); >> and only later: >> rdisk = read_balance(conf, r1_bio, &max_sectors); >> which is the one looking at next_resync. >> >> Basically, this guarantees that while READ is looking at next_resync, >> there are no sync requests in-flight, so for sure the area up to >> next_resync has already been synced. So my previous analysis seems to >> be incorrect? Or still somehow the nest_resync race can happen? > > I know I agreed with you last time when you were wrong, so agreeing with you > again probably would sound very convincing, but it seems you are right. > The RAID1 code doesn't seem to have a race here (which is good news to me of > course). > > So you have a corruption which could be caused by md/raid1, or by ext4fs, or > by mysql. Or maybe by your dm layer. And you cannot easily reproduce it. > That is not going to be easy to fix :-( I'm afraid there is nothing I can > suggest. > > NeilBrown > > >> >> Thanks, >> Alex. >> >> >> On Sun, Dec 29, 2013 at 11:41 PM, NeilBrown <neilb@xxxxxxx> wrote: >> > On Sun, 29 Dec 2013 13:27:35 +0200 Alexander Lyakas <alex.bolshoy@xxxxxxxxx> >> > wrote: >> > >> >> Hi Neil, >> >> I see in the code, that read_balance is careful not to do >> >> read-balancing, if the area in question is under resync: >> >> if (conf->mddev->recovery_cp < MaxSector && >> >> (this_sector + sectors >= conf->next_resync)) { >> >> choose_first = 1; >> >> start_disk = 0; >> >> } else { >> >> choose_first = 0; >> >> start_disk = conf->last_used; >> >> } >> >> >> >> However, I see that next_resync is updated in sync_request: >> >> conf->next_resync = sector_nr; >> >> before actually completing the resync of that area, before even >> >> submitting the read-bios. Is this correct? Should we perhaps update >> >> next_sync at the point where we update curr_resync_completed in >> >> md_do_sync? >> > >> > Your analysis appears correct. Thanks! >> > Can you send a patch? If so, could you check if raid10 needs a similar fix? >> > >> > Thanks, >> > NeilBrown >> > >> > >> > >> >> Kernel that I am looking at is Ubuntu-Precise 3.2.0-29.46. >> >> >> >> Below is a sequence of event that lead us to this question. >> >> >> >> We have a 3-disk RADI1, with ext4 mounted on top and mysql running, >> >> whose files are on that ext4. >> >> # 3-disk RAID1 was recovering from an unclean shutdown (crash) of the >> >> machine, it started resync: >> >> >> >> Dec 26 08:05:42 vc kernel: [11576870.520393] md: md1 stopped. >> >> Dec 26 08:05:42 vc kernel: [11576870.526898] md: bind<dm-2> >> >> Dec 26 08:05:42 vc kernel: [11576870.534819] md: bind<dm-3> >> >> Dec 26 08:05:42 vc kernel: [11576870.535908] md: bind<dm-1> >> >> Dec 26 08:05:42 vc kernel: [11576870.580730] md/raid1:md1: not clean >> >> -- starting background reconstruction >> >> Dec 26 08:05:42 vc kernel: [11576870.580735] md/raid1:md1: active with >> >> 3 out of 3 mirrors >> >> Dec 26 08:05:42 vc kernel: [11576870.581399] created bitmap (1 pages) >> >> for device md1 >> >> Dec 26 08:05:42 vc kernel: [11576870.582386] md1: bitmap initialized >> >> from disk: read 1/1 pages, set 5 of 5 bits >> >> Dec 26 08:05:42 vc kernel: [11576870.583335] md1: detected capacity >> >> change from 0 to 320773120 >> >> Dec 26 08:05:42 vc kernel: [11576870.585593] md: resync of RAID array md1 >> >> Dec 26 08:05:42 vc kernel: [11576870.585597] md: minimum _guaranteed_ >> >> speed: 10000 KB/sec/disk. >> >> Dec 26 08:05:42 vc kernel: [11576870.585601] md: using maximum >> >> available idle IO bandwidth (but not more than 200000 KB/sec) for >> >> resync. >> >> Dec 26 08:05:42 vc kernel: [11576870.585617] md: using 128k window, >> >> over a total of 313255k. >> >> Dec 26 08:05:42 vc kernel: [11576870.687868] md1: unknown partition table >> >> >> >> # meanwhile, ext4 was mounted on top of it (dm-4 is a devicemapper >> >> target on top of MD1, which collects some stats, but otherwise does >> >> not interfere with bios in any way). ext4 also realized the unclean >> >> shutdown: >> >> Dec 26 08:05:42 vc kernel: [11576870.936166] [14831]dm_iostat >> >> [dm_iostat_ctr] Successfully set up dm (252:4) iostats on top of >> >> /dev/md1 (9:1), flags=0x80 >> >> Dec 26 08:05:43 vc kernel: [11576871.159955] [EXT4 FS bs=1024, gc=39, >> >> bpg=8192, ipg=2016, mo=9c02c818, mo2=0000] >> >> Dec 26 08:05:43 vc kernel: [11576871.160092] EXT4-fs (dm-4): recovery complete >> >> Dec 26 08:05:43 vc kernel: [11576871.161099] EXT4-fs (dm-4): mounted >> >> filesystem with ordered data mode. Opts: debug,data_err=abort >> >> >> >> # mysql was started on top of ext4, and also realized the unclean shutdown: >> >> InnoDB: Log scan progressed past the checkpoint lsn 18148639 >> >> 131226 8:05:44 InnoDB: Database was not shut down normally! >> >> InnoDB: Starting crash recovery. >> >> InnoDB: Reading tablespace information from the .ibd files... >> >> InnoDB: Restoring possible half-written data pages from the doublewrite >> >> InnoDB: buffer... >> >> InnoDB: Doing recovery: scanned up to log sequence number 18154016 >> >> 131226 8:05:44 InnoDB: Starting an apply batch of log records to the >> >> database... >> >> InnoDB: Progress in percents: 11 12 13 14 15 16 17 18 19 20 21 22 23 >> >> 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 >> >> 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 >> >> 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 >> >> 93 94 95 96 97 98 99 >> >> InnoDB: Apply batch completed >> >> 131226 8:05:44 InnoDB: Waiting for the background threads to start >> >> >> >> 8 seconds later, raid1-resync completed: >> >> Dec 26 08:05:52 vc kernel: [11576880.161306] md: md1: resync done. >> >> Dec 26 08:05:52 vc kernel: [11576880.221701] RAID1 conf printout: >> >> Dec 26 08:05:52 vc kernel: [11576880.221705] --- wd:3 rd:3 >> >> Dec 26 08:05:52 vc kernel: [11576880.221709] disk 0, wo:0, o:1, dev:dm-1 >> >> Dec 26 08:05:52 vc kernel: [11576880.221712] disk 1, wo:0, o:1, dev:dm-2 >> >> Dec 26 08:05:52 vc kernel: [11576880.221714] disk 2, wo:0, o:1, dev:dm-3 >> >> >> >> # The system ran normally for about 5 minutes, and then was shutdown cleanly: >> >> # mysql was stopped cleanly >> >> 131226 8:11:28 [Note] /usr/sbin/mysqld: Normal shutdown >> >> 131226 8:11:28 [Note] Event Scheduler: Purging the queue. 0 events >> >> 131226 8:11:30 [Warning] /usr/sbin/mysqld: Forcing close of thread 68 >> >> user: 'change_vsa_user' >> >> 131226 8:11:30 [Warning] /usr/sbin/mysqld: Forcing close of thread 59 >> >> user: 'standbyVc0' >> >> 131226 8:11:30 InnoDB: Starting shutdown... >> >> 131226 8:11:31 InnoDB: Waiting for 3 pages to be flushed >> >> 131226 8:11:31 InnoDB: Shutdown completed; log sequence number 18626878 >> >> 131226 8:11:31 [Note] /usr/sbin/mysqld: Shutdown complete >> >> >> >> # ext4 was cleanly unmounted (no print in the kernel for that) >> >> # md1 was stopped cleanly >> >> Dec 26 08:11:31 vc kernel: [11577219.716151] md1: detected capacity >> >> change from 320773120 to 0 >> >> Dec 26 08:11:31 vc kernel: [11577219.716158] md: md1 stopped. >> >> Dec 26 08:11:31 vc kernel: [11577219.716167] md: unbind<dm-1> >> >> Dec 26 08:11:31 vc kernel: [11577219.740152] md: export_rdev(dm-1) >> >> Dec 26 08:11:31 vc kernel: [11577219.740174] md: unbind<dm-3> >> >> Dec 26 08:11:31 vc kernel: [11577219.772137] md: export_rdev(dm-3) >> >> Dec 26 08:11:31 vc kernel: [11577219.772158] md: unbind<dm-2> >> >> Dec 26 08:11:31 vc kernel: [11577219.810269] md: export_rdev(dm-2) >> >> >> >> # Then the system was brought up again >> >> # md1 was assembled with a clean bitmap >> >> Dec 26 08:11:40 vc kernel: [ 337.918676] md: md1 stopped. >> >> Dec 26 08:11:40 vc kernel: [ 337.923588] md: bind<dm-2> >> >> Dec 26 08:11:40 vc kernel: [ 337.924085] md: bind<dm-3> >> >> Dec 26 08:11:40 vc kernel: [ 337.924511] md: bind<dm-1> >> >> Dec 26 08:11:40 vc kernel: [ 337.927127] md: raid1 personality >> >> registered for level 1 >> >> Dec 26 08:11:40 vc kernel: [ 337.927264] bio: create slab <bio-1> at 1 >> >> Dec 26 08:11:40 vc kernel: [ 337.927435] md/raid1:md1: active with 3 >> >> out of 3 mirrors >> >> Dec 26 08:11:40 vc kernel: [ 337.927826] created bitmap (1 pages) for >> >> device md1 >> >> Dec 26 08:11:40 vc kernel: [ 337.928341] md1: bitmap initialized from >> >> disk: read 1/1 pages, set 0 of 5 bits >> >> Dec 26 08:11:40 vc kernel: [ 337.929316] md1: detected capacity >> >> change from 0 to 320773120 >> >> Dec 26 08:11:40 vc kernel: [ 337.934610] md1: unknown partition table >> >> >> >> # ext4 was mounted cleanly >> >> Dec 26 08:11:40 vc kernel: [ 338.036300] [6536]dm_iostat >> >> [dm_iostat_ctr] Successfully set up dm (252:4) iostats on top of >> >> /dev/md1 (9:1), flags=0x80 >> >> Dec 26 08:11:40 vc kernel: [ 338.063921] [EXT4 FS bs=1024, gc=39, >> >> bpg=8192, ipg=2016, mo=9c02c818, mo2=0000] >> >> Dec 26 08:11:40 vc kernel: [ 338.064104] EXT4-fs (dm-4): mounted >> >> filesystem with ordered data mode. Opts: debug,data_err=abort >> >> >> >> # but then mysql detected a corruption >> >> 131226 8:11:40 InnoDB: Error: space id and page n:o stored in the page >> >> InnoDB: read in are 543256175:543236195, should be 0:595! >> >> InnoDB: Database page corruption on disk or a failed >> >> InnoDB: file read of page 595. >> >> InnoDB: You may have to recover from a backup. >> >> 131226 8:11:40 InnoDB: Page dump in ascii and hex (16384 bytes): >> >> >> >> It happened only once, and we do a lot of testing of such unclean >> >> shutdowns.The only theory that we have is that raid1 resync somehow >> >> changed the underlying data of mysql/ext4. >> >> >> >> Thanks, >> >> Alex. >> > > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html