[Feature Request] Perform a clean shutdown if disk reaches critical temp


Recommended Posts

That is indeed a scenario which may happen if the disk is actively in use. Though it will not immediately spin down the disk a next time unless the temperature starts to climb. 

You'd be looking at a spin down followed by an immediate spin up (or a slight delay depending upon when the next read / write actually takes place.  Odds are that the drive would never actually spin down long enough to let the temp drop, and repeated spin ups / downs / ups / downs can't at all be good for the drive.

 

The temperature doesn't need to drop, a subsequent spin down command will only happen when the temperature went up, though this may lead to a never ending story ...

Link to comment

I think we need to be clear that we are only catering for environmental problems and not inherent system design issues. i.e. its too hot a day and not .... if i spin up 10 disks any day the system overheats.

 

Once we accept that then it is obvious that once something overheats it will likely be hours before the room cools again. Since we dont know air temp universally or reliably we cant use this factor so we have to play it safe and spin down forever until manual intervention comes.

Link to comment

To me, the easiest solution is provide one more setting on the Display Settings page, immediately after the current "Critical disk temperature threshold", for "Shutdown disk temperature threshold".  You really don't need a delay or any other option, because if it reaches the shutdown temp, then you already have received the Warning temp notification, and then the Critical temp notification.  If a temp reaches the Shutdown temp, powerdown is called.  I would default it to a high number like 80, so it does not surprise anyone, and I think we can all agree a temp of 80 should shut the server down.  Most would set the number considerably lower.  This should be easy to implement.

 

You mean something like this ?

 

That's a nice enhancement for the Critical Temp, but I would still prefer an additional temp threshold for Shutdown Temp.  It's easier to understand I think, and should be set high enough that there's no question about what should happen, high enough that you don't care about what might be going on or might be interfered with, you just want the system off.  Then drop the Shutdown option from the Critical temp choices, and add a choice to Stop Array, to Critical Temp actions.  All threshold triggers would send notifications, including the action that will be taken.  For spin downs, you might add a 60 second delay, which should spread them out if repeated.

 

If a user wants, they can always set Shutdown Temp to be only one or two degrees hotter than Critical Temp.  Just my opinion, but I would probably set the 3 temps to 45, 60, and 70, with Critical Temp action to Stop Array.  If you have a way to wait for array to be fully stopped, then send a Spin All Down, that would be a nice bonus!

Link to comment

A special consideration has to be considered here. If a drive is reporting a high temperature due to a sensor problem or something, you could create a scenario that unRaid would shutdown immediately after the reboot and give the user no opportunity to diagnose the problem.

Link to comment

How likely is that scenario compared to say someone really having a hot disk and it failing. To me this is just a problem of getting the logging right.

 

Could also occur if someone fat fingers the value and saves it. Poof, server shutdown and won't come back up! :)

Link to comment

 

How likely is that scenario compared to say someone really having a hot disk and it failing. To me this is just a problem of getting the logging right.

 

Could also occur if someone fat fingers the value and saves it. Poof, server shutdown and won't come back up! :)

 

Maybe this "feature" could be set to disabled if one entered safe mode?

Link to comment

This time of year, my home office reaches high 80s late afternoon. Especially on parity check day (like today), I get a temp warning from one disc (&^% 5TB tosh). Reminds me to put a floor fan down in front of it until it finishes. I'd hate to have it shut down if I wasn't here to be reminded. :P

 

These big 6TB discs have to churn for nearly 20 hours straight once a month. They get kinda warm.

Link to comment

A special consideration has to be considered here. If a drive is reporting a high temperature due to a sensor problem or something, you could create a scenario that unRaid would shutdown immediately after the reboot and give the user no opportunity to diagnose the problem.

Do you have any ideas to avoid that?  The user would have received notifications about the exact drive, probably multiple notifications.  I do know it's possible, as I have one drive right now where the reported temp can bounce from 62 to 94, currently reporting 61 but seems about the same temp (mid 30's) as the others.  The drive has issues, but I still use it for low value old videos, in a Windows station.

 

Could also occur if someone fat fingers the value and saves it. Poof, server shutdown and won't come back up! :)

I don't think this is a problem, because the temp settings could be checked for validity before accepted (e.g. Critical temp must be at least one degree above Warning temp, and Shutdown temp must be at least one degree above Critical temp, or settings aren't accepted/saved).  You would have had to have fat fingered all of them.

 

This time of year, my home office reaches high 80s late afternoon.

Since drive temps are always reported in Celsius, we tend to always do so too.  Your high 80's would be mid 30's to the drives, much lower than the 40's to 70's we are talking about. My suggested default of 70 is about 150 F.

 

Disabling an overtemp shutdown might be a good idea in Safe Mode.  And before shutting down the first time, it could disable auto-start of the array.

Link to comment

Do you have any ideas to avoid that?  The user would have received notifications about the exact drive, probably multiple notifications.  I do know it's possible, as I have one drive right now where the reported temp can bounce from 62 to 94, currently reporting 61 but seems about the same temp (mid 30's) as the others.  The drive has issues, but I still use it for low value old videos, in a Windows station.

 

Safe mode is an interesting idea, but entering safe mode most likely has nothing to do with temperature, and I'm not sure we'd want this feature disabled for safe mode.

 

One simple idea is to document how to manually enter a setting file to change the critical temperature or turn off the feature.

 

Another idea is to disable the feature for some period of time from boot (say 10 mins) to give the user the opportunity to change their configuration before it shuts down the server.

Link to comment
  • 5 months later...
  • 1 month later...

I honestly think this feature would be wonderful to have. Especially for us who do not put our systems up on the net to remotely shut it down. I understand there is some logistics to this as not all systems are the same. However, this would be a nice almost insurance if a fan goes bad and no owner around to correct the problem before it becomes severe!

Link to comment
  • 1 year later...
  • 2 weeks later...
On 25-6-2015 at 10:12 AM, Fireball3 said:

How about removing the disk load?

That means pausing/stopping the load generating process.

Idle disks shouldn't run hot even if cooling has failed.

 

Interesting... 

 

It does not sound to difficult to GLOBALLY exclude a disk at the moment it shows some kind of misery...

 

Globsl exclusion could be set on temp, but also based on smart values, a read error, etc...

 

Does not sound like a bad idea actually and the basic functionality is already there..

 

(ofcourse this would only work if people use user shares... disk shares would still be possible to write to..)

Edited by Helmonder
Link to comment
  • 3 years later...

I would be interested in seeing this come to fruition for scenarios where cooling is normally adequate but sudden failure leads to sky rocketing temps.

 

I had a power outage where my main server is hosted, server stayed up on battery, but when the power came back the AC was not switched back on and a parity check kicked off.

 

This led to my disks running at 60-62C for the whole night until I woke up and saw the 50+ alerts from UnRAID and shut the server down. Every single one of my disks reached 60C at one time or another.

 

I'm thinking about stronger fans that can move more air as well but in a locked DC closet with limited airflow without the AC I think a shutdown would always be the safer scenario.

Edited by weirdcrap
Link to comment
16 minutes ago, JorgeB said:

That only helps if it's doing a parity check.

Well in this case that would have helped me a bit. The drives were toasty but not overheating before the parity check kicked off. 

 

I'll look into that plugin as I apparently can't trust people to remember to turn on the flippin air conditioner after a power outage.

 

It would have saved me from several of my drives toasting themselves out of warranty coverage (the older Reds have a max op temp of 60C).

 

Anyone have experience with RMA'ing drives that are overheated? is that something normally checked by WD? I'm just trying to educate myself for the future on how likely I'm going to be screwed by this little incident if I end up trying to RMA some of these down the road.

Edited by weirdcrap
Link to comment
4 minutes ago, JorgeB said:

For this last case yes, my issue was without a check running, a fan in a 5in3 cage just stopped.

Yeah I would be interested in both your original use case (failed fan) and my fringe case (failed AC and parity check kick off).

 

I have 4x 5in3 drive cages with separate fans in my second server and would definitely be interested in having the ability to stop the array or shut the system down if one of those failed and my drives started heating up real bad.

 

To put my mind at ease a bit, when you had this happen to you @JorgeB did you notice the "cooked" drives failed at a higher rate than the others?

 

I'm waiting for someone to get into the DC and check the AC before my system gets powered up so right now i'm just doing a lot of reading on overheated drives and possible issues I may encounter.

Edited by weirdcrap
Link to comment

It would be easy enough to write a plugin that detected if overheating of drives is happening.   Whether that should just send a notification or force an automated shutdown is an interesting philosophical point.   I guess it could be configurable.
 

I could easily enhance the parity tuning plugin to send a notification on drives overheating even if no parity check is running.   Not sure if this would be the right way to handle this - a separate plugin seems a better solution.  I might look into putting together such a plugin if sufficient interest is shown as I should already have all the bits required to assemble it.  Time to start thinking of a good name for such a plugin :)

  • Like 2
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.