[Solved?] 2 Failed Disks


Joseph

Recommended Posts

I've been chasing issues similar to this for months and still can't find the root cause... seems to happen when doing heavy read/writes to and from the SSD cache pool.

 

Yesterday 2 drives got knocked offline and marked as bad while doing heavy read/writes to and from the unRAID cache SSDs. The first one happened in the middle of the night while data was being moved from the cache via the mover AND while jdownloader2 was writing to the cache; the second one happened the next morning while using jdownloader and unzipping files thru a VM AND accessing shares via the network. The strange thing is, it doesn't appear that either of these HDDs were being written to or accessed.

 

I've tried swapping data cables, double checking power connectors and the problem stays with at least one drive, which is designed for large disk NAS environments. I recently checked SMART on all drives and at the time all was OK. I'm beginning to suspect there's something else going on. Other than cables, power supply and perhaps controllers what else could be causing this? I have also seen unusual error messages in the logs that might be related to CPU overheating but I would think that if it throttles back, there shouldn't be any issues.

 

Thoughts and help much appreciated!

 

 

Edited by Joseph
Link to comment
  • 2 weeks later...

I decided to try replacing the entire data cable harness first. It took awhile to get in, but I've have it replaced and I'm rebuilding parity failed disks from parity now. will keep this post updated.

 

UPDATE:

Both hard drives were rebuilt via parity without any issue. However, I am cautiously optimistic. If memory serves, this is exactly what happened last time....the drives were only knocked offline during a scheduled parity check.

 

UPDATE 2:

Yesterday, a rogue process hammered the load on all cores/threads to 100% utilization. (I suspect it was the Emby transcoder.) I stopped all VMs & Dockers to no avail. When I attempted to stop the array, unRAID was unable to unmount the shares and the webgui was no longer responsive. I attempted to shutdown via the command line but in the end I had to hard shutdown. Upon reboot, unRAID detected an unclean shutdown and so I started a parity check. During that time the CPU usage was nominal and it finished overnight with no issues. Again, I'm pretty sure the drives get knocked offline during a scheduled parity check. So, I will report back after the next one (the first of the month.)  TL;DR Ran a parity check from an unclean shutdown and it completed with no issues.

 

FINAL UPDATE:

Parity check finished with 0 errors. Marking this as solved, even though I'm still concerned it happens when doing heavy read/writes to and from the SSD cache pool and a parity check is running.

Edited by Joseph
Update
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.