Joseph Posted September 6, 2017 Share Posted September 6, 2017 (edited) I've been chasing issues similar to this for months and still can't find the root cause... seems to happen when doing heavy read/writes to and from the SSD cache pool. Yesterday 2 drives got knocked offline and marked as bad while doing heavy read/writes to and from the unRAID cache SSDs. The first one happened in the middle of the night while data was being moved from the cache via the mover AND while jdownloader2 was writing to the cache; the second one happened the next morning while using jdownloader and unzipping files thru a VM AND accessing shares via the network. The strange thing is, it doesn't appear that either of these HDDs were being written to or accessed. I've tried swapping data cables, double checking power connectors and the problem stays with at least one drive, which is designed for large disk NAS environments. I recently checked SMART on all drives and at the time all was OK. I'm beginning to suspect there's something else going on. Other than cables, power supply and perhaps controllers what else could be causing this? I have also seen unusual error messages in the logs that might be related to CPU overheating but I would think that if it throttles back, there shouldn't be any issues. Thoughts and help much appreciated! Edited October 1, 2017 by Joseph Quote Link to comment
JorgeB Posted September 6, 2017 Share Posted September 6, 2017 Nothing jumps out, if the problem stays with one disk you should try replacing it, healthy SMART does not always equal healthy disk. Quote Link to comment
Joseph Posted September 17, 2017 Author Share Posted September 17, 2017 (edited) I decided to try replacing the entire data cable harness first. It took awhile to get in, but I've have it replaced and I'm rebuilding parity failed disks from parity now. will keep this post updated. UPDATE: Both hard drives were rebuilt via parity without any issue. However, I am cautiously optimistic. If memory serves, this is exactly what happened last time....the drives were only knocked offline during a scheduled parity check. UPDATE 2: Yesterday, a rogue process hammered the load on all cores/threads to 100% utilization. (I suspect it was the Emby transcoder.) I stopped all VMs & Dockers to no avail. When I attempted to stop the array, unRAID was unable to unmount the shares and the webgui was no longer responsive. I attempted to shutdown via the command line but in the end I had to hard shutdown. Upon reboot, unRAID detected an unclean shutdown and so I started a parity check. During that time the CPU usage was nominal and it finished overnight with no issues. Again, I'm pretty sure the drives get knocked offline during a scheduled parity check. So, I will report back after the next one (the first of the month.) TL;DR Ran a parity check from an unclean shutdown and it completed with no issues. FINAL UPDATE: Parity check finished with 0 errors. Marking this as solved, even though I'm still concerned it happens when doing heavy read/writes to and from the SSD cache pool and a parity check is running. Edited October 1, 2017 by Joseph Update Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.