[Solved?] 2 Failed Disks

Joseph · September 6, 2017

I've been chasing issues similar to this for months and still can't find the root cause... seems to happen when doing heavy read/writes to and from the SSD cache pool.

Yesterday 2 drives got knocked offline and marked as bad while doing heavy read/writes to and from the unRAID cache SSDs. The first one happened in the middle of the night while data was being moved from the cache via the mover AND while jdownloader2 was writing to the cache; the second one happened the next morning while using jdownloader and unzipping files thru a VM AND accessing shares via the network. The strange thing is, it doesn't appear that either of these HDDs were being written to or accessed.

I've tried swapping data cables, double checking power connectors and the problem stays with at least one drive, which is designed for large disk NAS environments. I recently checked SMART on all drives and at the time all was OK. I'm beginning to suspect there's something else going on. Other than cables, power supply and perhaps controllers what else could be causing this? I have also seen unusual error messages in the logs that might be related to CPU overheating but I would think that if it throttles back, there shouldn't be any issues.

Thoughts and help much appreciated!

Edited October 1, 2017 by Joseph

JorgeB · September 6, 2017

Nothing jumps out, if the problem stays with one disk you should try replacing it, healthy SMART does not always equal healthy disk.

Joseph · September 17, 2017

I decided to try replacing the entire data cable harness first. It took awhile to get in, but I've have it replaced and I'm rebuilding ~~parity~~ failed disks from parity now. will keep this post updated.

UPDATE:

Both hard drives were rebuilt via parity without any issue. However, I am cautiously optimistic. If memory serves, this is exactly what happened last time....the drives were only knocked offline during a scheduled parity check.

UPDATE 2:

Yesterday, a rogue process hammered the load on all cores/threads to 100% utilization. (I suspect it was the Emby transcoder.) I stopped all VMs & Dockers to no avail. When I attempted to stop the array, unRAID was unable to unmount the shares and the webgui was no longer responsive. I attempted to shutdown via the command line but in the end I had to hard shutdown. Upon reboot, unRAID detected an unclean shutdown and so I started a parity check. During that time the CPU usage was nominal and it finished overnight with no issues. Again, I'm pretty sure the drives get knocked offline during a scheduled parity check. So, I will report back after the next one (the first of the month.) TL;DR Ran a parity check from an unclean shutdown and it completed with no issues.

FINAL UPDATE:

Parity check finished with 0 errors. Marking this as solved, even though I'm still concerned it happens when doing heavy read/writes to and from the SSD cache pool and a parity check is running.

Edited October 1, 2017 by Joseph
Update

[Solved?] 2 Failed Disks

Recommended Posts

Joseph

Link to comment

JorgeB

Link to comment

Joseph

Link to comment

Join the conversation