btlupin

Sudden random reboots UnRaid 4.5

29 posts in this topic

Hi,

 

I am on 4.5 using the original ASUS motherboard. I am suddenly getting random reboots at different times of the day. I have tried to capture the syslog between reboots, but this has not worked well so far. In my go script I have some code from Purko that I am using to copy the syslog to the flash:

 

mv -f /boot/flashlog.current /boot/flashlog.last 2>/dev/null

cat /var/log/syslog > /boot/flashlog.current

echo "*.debug  /boot/flashlog.current" >> /etc/syslog.conf

/etc/rc.d/rc.syslog restart

 

which seems to work, but I think that multiple reboots cause the flashlog.last to be overwritten and by the time I notice that the system has rebooted (parity check running) I don't see any info in the last log or current log that points to any problem. Both logs look like a normal syslog after a reboot. I have a UPS so I don't thing that the server is losing power.

 

Is there a way to put a timestamp in the syslog file name so I can track the reboots better?

 

The system has been very stable otherwise.

 

Thanks,

 

Roland

0

Share this post


Link to post
Share on other sites

This is a classic sign of a memory or mobo problem.

 

Even though it has been stable, a memory cell could have gone bad, some corrosion could have affected a contact, etc.

 

1) Pull the ram, swap the slots (if more than one) and reseat it well.

 

2) Run memtest for several passes

 

3) pull and reseat all expansion cards.

0

Share this post


Link to post
Share on other sites

Memtest ran for 19 passes without any errors or reboots. If it was a hardware problem I would expect the computer to reboot no matter if memtest or unraid is running. I am stumped.

0

Share this post


Link to post
Share on other sites

Memtest ran for 19 passes without any errors or reboots. If it was a hardware problem I would expect the computer to reboot no matter if memtest or unraid is running. I am stumped.

Not always true...

If it is a problem with a piece of hardware not involved with the memory test it could easily not show up when testing only the memory. 

 

In the same way, if a power supply was marginal, or if a memory strip was sensitive to voltage variations it might not act up until you put a different load on the power supply, such as spinning up a set of disks drives.

 

It sounds as if your memory is working OK, but to be absolutely certain verify the voltage setting for it, and the timings, and the clock speed are within the recommended ranges set by the manufacturer of your memory strip brand/model for your specific set of memory.

 

Then, proceed as instructed, power down and re-seat the boards...  then, run a

tail -f /var/log/syslog

in a console window until it crashes once more.  It could still be almost anything, including running out of RAM.

0

Share this post


Link to post
Share on other sites

Unfortunately, the output from tail -f /var/log/syslog didn't give much info:

 

Feb 23 18:15:06 Nas ntpd[1544]: synchronized to 192.121.13.59, stratum 2

Feb 23 19:03:46 Nas kernel: mdcmd (443): spindown 0

Feb 23 19:03:48 Nas kernel: mdcmd (444): spindown 3

Feb 23 19:24:03 Nas kernel: mdcmd (585): spindown 4

Feb 23 19:24:14 Nas kernel: mdcmd (587): spindown 8

Feb 23 19:27:05 Nas kernel: mdcmd (608): spindown 1

Feb 23 19:27:06 Nas kernel: mdcmd (609): spindown 2

Feb 23 19:27:07 Nas kernel: mdcmd (610): spindown 6

Feb 23 19:27:08 Nas kernel: mdcmd (611): spindown 7

Feb 23 20:48:27 Nas kernel: mdcmd (1183): spindown 5

 

and then the server rebooted without any other additional messages. I have attached the last syslog, though I don't see any obvious error messages.

unraid_last_syslog.zip

0

Share this post


Link to post
Share on other sites

Wow, I don't see anything unusual either.

 

I'd check power supply connections first, especially if the power supply is shutting off.  I'd also check airflow and temperatures, in case something is shutting down in self-defense. Perhaps a fan has stopped spinning?

 

Joe L.

0

Share this post


Link to post
Share on other sites

Just because it passes a memory test does not mean the hardware is in perfect working order.

 

I'll repeat:

 

Look at the power supply and motherboard for signs of capacitor plague.

 

0

Share this post


Link to post
Share on other sites

I looked at the capacitors on the motherboard and they look ok (no bulging or leaking), but not sure about the PSU - I would have to remove it and open it which I am a bit hesitant about doing until I have tried everything else.

 

Roland

0

Share this post


Link to post
Share on other sites

I moved the server off the ups and in the room where I work so I could monitor the reboots better, and since moving it no reboots. We haven't had any power outages, the the ups/powerdown script should take care of that type of problem, so I don't know what to think.

 

Roland

0

Share this post


Link to post
Share on other sites

I moved the server off the ups and in the room where I work so I could monitor the reboots better, and since moving it no reboots. We haven't had any power outages, the the ups/powerdown script should take care of that type of problem, so I don't know what to think.

 

Roland

I would think you have an intermittent connection somewhere, and moving the server has (temporarily) make it less sensitive to heat/voltage/vibration than it was in its prior location.

 

Did you follow the advice given earlier by BubbaQ to power down and re-seat all expansion boards, memory strips, cables?

0

Share this post


Link to post
Share on other sites

Yes, I did check the connections and re-seated the SATA card, but the reboots continued until moving the server off the ups and into a different room. I think that I will try running the server without the ups at its original location to see if the reboots come back or not.

0

Share this post


Link to post
Share on other sites

Ah ha.  What make/Model of UPS and what make/model of PSU?

 

 

 

 

0

Share this post


Link to post
Share on other sites

the psu is an Antec Neo 380 W and the UPS is an APC - Back-UPS CS 650VA - 230V. I have been using it for about 9 months now. We don't have a lot of power outages, but the one that we have had, the ups worked as it should and shutdown the server correctly thanks to the powerdown script. So far only the unraid server is hooked up to the ups.

0

Share this post


Link to post
Share on other sites

the psu is an Antec Neo 380 W and the UPS is an APC - Back-UPS CS 650VA - 230V. I have been using it for about 9 months now. We don't have a lot of power outages, but the one that we have had, the ups worked as it should and shutdown the server correctly thanks to the powerdown script. So far only the unraid server is hooked up to the ups.

How many hard-disks in your server?  Did you recently add one?

 

I ask because I looked up the power supply and it seems to have three 12 volt rails.

Typically, one is dedicated to the motherboard connectors, one to expansion cards, and the remaining to molex connectors for disks.

 

Not sure how you have yours connected, or how you have split the load, but it might be marginal.

 

Joe L.

0

Share this post


Link to post
Share on other sites

That PSU is junk crap not very good, and doesn't like the stepped sine-wave of the UPS.

 

Joe is right... get a PSU with a single-rail on 12V.

0

Share this post


Link to post
Share on other sites

I have had the same setup for about 3 years - 8 data drives, 1 parity drive and 1 cache drive. No idea how the rail split is, it has worked up to now. Any recommendations for a good psu that can handle 10 to 12 drives?

 

Roland

0

Share this post


Link to post
Share on other sites

Either a Corsair 400 W, 450 W, or 550 W should do you nicely.  Consult a PSU Calculator to figure out exactly which one you need.

 

Or, if you want to save some money, the Antec Earthwatts series is also quite good.  Here's a 380 W, 430 W, and a 500 W.  Note that all of these have two 12+ V rails, so the Corsairs are better.

 

For reference, I'm currently running 8 hdds (some green, some not) on a 380 W Antec Earthwatts.

0

Share this post


Link to post
Share on other sites

I ordered a Corsair VX550W. The server still hasn't rebooted, and its been running all day. This type of sudden problem after years of working and without any syslog info is difficult to analyze. Hopefully the new psu is well spent money and solves the problem.

 

Roland

0

Share this post


Link to post
Share on other sites

Did you recently add the UPS or has it been in the loop the whole time?

0

Share this post


Link to post
Share on other sites

I have been using it for the last 9 months. I don't mind getting a new psu, since I was always worried that it was on the limit of what it could drive.

0

Share this post


Link to post
Share on other sites

I have been using it for the last 9 months. I don't mind getting a new psu, since I was always worried that it was on the limit of what it could drive.

Another possibility, although remote, is the voltage at the power outlet where you had the server is lower than where you have it now, or it shares the curcuit with some power hungry appliance.
0

Share this post


Link to post
Share on other sites

Not necessarily... the UPS can be set to a very wide tolerance of under/over voltages to pass through.

0

Share this post


Link to post
Share on other sites

Inexpensive UPS do not compensate for minor voltage fluctuations.

 

For example, Based on the "Status" report from my APC Back-UPS-750 UPS, my current line voltage is 122.0 Volts

According to the status report it will transfer from line supply to the UPS only if the line voltage goes below 91.0 volts or above 139.0 volts.

 

If I had a power supply in my server that could not handle brown-out conditions, it is likely it too would go out of regulation if the line voltage dropped to 95 volts.   (The UPS would not yet kick in at that voltage)

 

More expensive high-end UPS commercial UPS will have voltage regulation built in, but very few marketed for home use will.

For most others, there will always be a range of voltages which they will simply pass through to the equipment connected to them.

 

Joe L.

 

0

Share this post


Link to post
Share on other sites

Copyright © 2005-2017 Lime Technology, Inc. unRAID® is a registered trademark of Lime Technology, Inc.