[solved] disk read errors, is it cabling or the disk itself?


Recommended Posts

Hi all,

Just encountered my first ever Array Health Fail notification. Apparently my disk1 has/had read errors.

This happened a few hours after I upgraded my CPU, which required me to unplug all data and power cables to all my disks. So hopefully (most likely) this is just a cabling issue.

The disk has passed a short and long SMART test after the read errors. But I've noticed the raw_read_error_rate counter has increased over the last few days which I thought was an internal disk problem, not cabling?

Diagnostics attached. I have read a fair few older posts with similar issues, but I'm not confident in my ability to parse the syslog and smart reports correctly. If someone more knowledgeable then myself could have a look and suggest next steps that would be greatly appreciated.

 

Thanks in advance

Jorgen

 

First read error in syslog:

Jul  2 17:23:26 Tower kernel: ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jul  2 17:23:26 Tower kernel: ata6.00: irq_stat 0x40000001
Jul  2 17:23:26 Tower kernel: ata6.00: failed command: READ DMA EXT
Jul  2 17:23:26 Tower kernel: ata6.00: cmd 25/00:08:40:00:48/00:00:6f:01:00/e0 tag 18 dma 4096 in
Jul  2 17:23:26 Tower kernel:         res 51/40:08:40:00:48/00:00:6f:01:00/e0 Emask 0x9 (media error)
Jul  2 17:23:26 Tower kernel: ata6.00: status: { DRDY ERR }
Jul  2 17:23:26 Tower kernel: ata6.00: error: { UNC }
Jul  2 17:23:26 Tower kernel: ata6.00: configured for UDMA/133
Jul  2 17:23:26 Tower kernel: sd 6:0:0:0: [sdf] tag#18 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Jul  2 17:23:26 Tower kernel: sd 6:0:0:0: [sdf] tag#18 Sense Key : 0x3 [current] 
Jul  2 17:23:26 Tower kernel: sd 6:0:0:0: [sdf] tag#18 ASC=0x11 ASCQ=0x4 
Jul  2 17:23:26 Tower kernel: sd 6:0:0:0: [sdf] tag#18 CDB: opcode=0x88 88 00 00 00 00 01 6f 48 00 40 00 00 00 08 00 00
Jul  2 17:23:26 Tower kernel: blk_update_request: I/O error, dev sdf, sector 6161956928
Jul  2 17:23:26 Tower kernel: ata6: EH complete
Jul  2 17:23:26 Tower kernel: md: disk1 read error, sector=6161956864

 

 

tower-diagnostics-20170705-1059.zip

Edited by Jorgen
Link to comment
It looks more like a disk problem, though SMART is still OK for now, swap cables with another disk, if there are more errors in the near future it may be a bad disk.

Thanks, appreciate your advice.

I've reseated the cables for now, but haven't swapped them around yet. Will keep a close eye on things and try you suggestion if it keeps happening.

 

 

Sent from my iPhone using Tapatalk

Link to comment

Update:

Read errors keep occurring. Raw Read Error Rate = 66 today (raw value), from 48 only 3 days ago.

Looks like different sectors each time which I take as an indicator of problems outside the disk. Plus the fact that short and long smart tests do not increase the read errors either.

 

Opened the case to follow the advise from @johnnie.black and noticed that the motherboard sata socket that disk1 is plugged into is very loose compared to the other sockets. This is now my prime suspect.

 

I've moved the disk1 sata cable to another motherboard socket. Did not change the cable itself, but did reseat it on the disk end once more.

Back to monitoring mode.

Fresh diagnostic attached for completeness.

 

tower-diagnostics-20170710-2215.zip

Link to comment
2 minutes ago, Jorgen said:

Raw Read Error Rate = 66 today (raw value)

 

This attribute is okay:

 

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       66

Raw value is not important for this one, current value is 200 and threshold is 51, it means you only need to worry if the value goes below 51.

 

Link to comment
 
This attribute is okay:
 
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       66

Raw value is not important for this one, current value is 200 and threshold is 51, it means you only need to worry if the value goes below 51.
 


Yes agree, by itself it's not important, but they seem to correlate with the read errors in the syslog that trigger scary warning emails.
Link to comment
  • 2 weeks later...
On 11/07/2017 at 0:23 AM, johnnie.black said:

 

Very doubtful.

 

You are of course right. The correlation was only that they both started happening on the same day after 2 years of flawless operation. But I have now seen both types of errors occur independently of each other.

 

Since my last post I've added a new 8TB drive and decided to move all content off disk1 (the one with read errors) onto the new disk4. Mainly because I wanted to convert disk1 to XFS.

During this move, I kept seeing lots of read errors, both in the syslog and the raw smart value (that I'm now trying my hardest to ignore). The move of 3TB+ resulted in about 1000 read errors according to syslog. Diagnostics attached. 

I didn't realise this before the data move, but I must have somehow moved disk1 back to the dodgy motherboard socket when I added the new disk4. Sigh. Although to be honest I don't have much choice. I only have 6 sata ports on the motherboard, and they are all occupied now.

 

So in an effort to try to pinpoint where the read errors are coming from, I'd like to stress test both the mb port, disk1 and the cable between them, while changing only one thing at the time.

First test is the cable. I've replaced the old cable with a brand new one, but left disk1 connected to the suspect mb port.

 

So here's my question (finally): since there's no data on disk1 now, how do I trigger lots of reads? I need big volumes, because the errors are intermittent Is there a way to run only the pre-read part of the pre-clear scripts? Or should I just copy big chunks of data to/from the disk?

 

tower-diagnostics-20170719-2027.zip

Link to comment
Did you ever change the cable?   The simplest way to confirm if this is a cable issue is to simply replace the SATA cable with a high-quality cable and be sure it's firmly seated on both ends.
 

Yes, cable is replaced. Just need to trigger lots of reads. Copying 1TB of files over to the disk now, will run extended smart after that followed by the pre-read portion of the pre-clear script.
If I get a read error I will swap to another mb socket and repeat the tests.
If I still get errors it have to be disk itself.
I'll report back...


Sent from my iPhone using Tapatalk
Link to comment
32 minutes ago, Jorgen said:

Would the extended smart test stress the cable and mb socket though? I always thought the smart tests were performed internal to the drive?

 

You are correct. Something like a preclear (preread) would exercise the disk end to end (from computer to the drive including HBA, cabling, backplanes, etc).

 

The drive extended test runs 100% on the drive. If the cabling works well enough to initiate the test, that is all that is required. I expect (not recommending it) but that you could pull the sata cable and the self-test would continue unaffected.

 

The nice thing about the self-test is that if it fails, all fingers and toes point at the drive. It can't be anything else. And if it runs successfully and there continue to be issues from the OS, you have great confidence it is not with the drive itself, and something like a cabling issue or HBA compatibility issue.

 

  • Upvote 1
Link to comment

Extended smart test came back clean as a whistle.

 

Moving on to the next step in my testing plan, and I'm struggling.

I'd like to run only the pre-read part of the preclear script, but it doesn't look like the preclear plugin or Joe L's or bjp999's scripts can do this. Or at least I can't work out how to.

The next problem is that disk1 has data on it and is assigned to the array already. And both the plugin and the scripts prevent you from running on a disk assigned to the array.

 

I'm open to moving the data off the disk and removing it from the array, then running a full pre-clear cycle before adding it back.

But I need help with the steps on how to do this without stuffing up my array or lose data. I've never taken a disk out of the array before.

 

If someone could point me in the right direction I'd be very grateful.

Edited by Jorgen
Link to comment

What you really need to do is anything that will read all of the data -- and then compare the SMART results before & after the process to see if anything changed.   That's essentially what the pre-read phase does, but as you noted you can't do this for a disk in the array.   Do you have another disk in the array with enough space to copy everything to it?  (or a disk somewhere else on your network)  ... if so, just copy all of the files to that other location (put them in a folder called "TrashforTest" -- which you can simply delete after the copy completes.

 

One thing that will NOT do is read the sectors that don't have data on them -- but from the tests you've done so far it seems reasonably certain that if you can read all of the data then the disk is fine.

 

 

 

Link to comment

Ah, thanks Garycase, I've clearly over-complicated this!

 

I have plenty of free space, so here's the plan:

1. Move all current data (800GB) off disk1.

2. Fill disk1 with copies of data from other disks. These files will go into a "temp test" folder at the root of disk1.

3. Move all the test data to another disk "temp test 2" folder

4. Delete both "temp test" folders

 

Or maybe I could simplify it further

If deleting or writing data also results in a read I could remove step 3. 

This seemed to happen when I moved to 800GB onto disk1.

Link to comment

Filling disk1 and then copying all that data to another location will definitely cause it to be thoroughly tested -- you'll be writing the entire disk and then reading all of the data you've just written.    I'd make a copy of the SMART results before and after you do this -- and then see if anything changed (other than the obvious -- i.e. it will have more power-on hours).   At this point, I think it's likely that nothing will change and you'll simply confirm that your issue was the cabling.

 

 

Link to comment

Great, thanks.
I'll look into using sync with verification so I only have to copy it once and then delete it.

And agree it looks like it was a dodgy cable which is actually the best possible outcome.
The move has done 7 million reads now and not a single error.


Sent from my iPhone using Tapatalk

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.