On Tue 19 May 2015 09:23:20 AM Phil Turmel wrote: > On 05/19/2015 08:50 AM, Thomas Fjellstrom wrote: > > On Tue 19 May 2015 08:34:55 AM Phil Turmel wrote: > >> Based on the smart report, this drive is perfectly healthy. A small > >> number of uncorrectable read errors is normal in the life of any drive. > > > > Is it perfectly normal for the same sector to be reported uncorrectable 5 > > times in a row like it did? > > Yes, if you keep trying to read it. Unreadable sectors stay unreadable, > generally, until they are re-written. That's the first opportunity the > drive has to decide if a relocation is necessary. > > > How many UREs are considered "ok"? Tens, hundreds, thousands, tens of > > thousands? > > Depends. In a properly functioning array that gets scrubbed > occasionally, or sufficiently heavy use to read the entire contents > occasionally, the UREs get rewritten by MD right away. Any UREs then > only show up once. I have made sure that it's doing regular scrubs, and regular SMART scans. This time... > In a desktop environment, or non-raid, or improperly configured raid, > the UREs will build up, and get reported on every read attempt. > > Most consumer-grade drives claim a URE average below 1 per 1E14 bits > read. So by the end of their warranty period, getting one every 12TB > read wouldn't be unusual. This sort of thing follows a Poisson > distribution: > > http://marc.info/?l=linux-raid&m=135863964624202&w=2 > > > These drives have been barely used. Most of their life, they were either > > off, or not actually being used. (it took a while to collect enough 3TB > > drives, and then find time to build the array, and set it up as a regular > > backup of my 11TB nas). > > While being off may lengthen their life somewhat, the magnetic domains > on these things are so small that some degradation will happen just > sitting there. Diffusion in the p- and n-doped regions of the > semiconductors is also happening while sitting unused, degrading the > electronics. > > >> It has no relocations, and no pending sectors. The latency spikes are > >> > >> likely due to slow degradation of some sectors that the drive is having > >> to internally retry to read successfully. Again, normal. > > > > The latency spikes are /very/ regular and theres quite a lot of them. > > See: http://i.imgur.com/QjTl6o3.png > > Interesting. I suspect that if you wipe that disk with noise, read it > all back, and wipe it again, you'll have a handful of relocations. It looks like each one of the blocks in that display is 128KiB. Which i think means those red blocks aren't very far apart. Maybe 80MiB apart? Would it reallocate all of those? That'd be a lot of reallocated sectors. > Your latency test will show different numbers then, as the head will > have to seek to the spare sector and back whenever you read through one > of those spots. > > Or the rewrites will fix them all, and you'll have no further problems. > Hard to tell. Bottom line is that drives can't fix any problems they > have unless they are *written* in previously identified problem areas. > > >> I own some "DM001" drives -- they are unsuited to raid duty as they > >> don't support ERC. So, out of the box, they are time bombs for any > >> array you put them in. That's almost certainly why they were ejected > >> from your array. > >> > >> If you absolutely must use them, you *must* set the *driver* timeout to > >> 120 seconds or more. > > > > I've been planning on looking into the ERC stuff. I now actually have some > > drives that do support ERC, so it'll be interesting to make sure > > everything is set up properly. > > You have it backwards. If you have WD Reds, they are correct out of the > box. It's when you *don't* have ERC support, or you only have desktop > ERC, that you need to take special action. I was under the impression you still had to enable ERC on boot. And I /thought/ I read that you still want to adjust the timeouts, though not the same as for consumer drives. > If you have consumer grade drives in a raid array, and you don't have > boot scripts or udev rules to deal with timeout mismatch, your *ss is > hanging in the wind. The links in my last msg should help you out. There was some talk of ERC/TLER and md. I'll still have to find or write a script to properly set up timeouts and enable TLER on drives capable of it (that don't come with it enabled by default). > Also, I noticed that you used "smartctl -a" to post a complete report of > your drive's status. It's not complete. You should get in the habit of > using "smartctl -x" instead, so you see the ERC status, too. Good to know. Thanks. > Phil > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Thomas Fjellstrom thomas@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html