Server freezing/locking up completely


Recommended Posts

Some backstory, I was running 5.0.5 and had about 200 days of uptime when I noticed my server had dropped off the network during a file write. Telnet, http, all not working. I IPMI'd in and the screen was showing the typical Unraid login, but was totally unresponsive. I rebooted, ran a parity sync that showed no sync mismatches, ran a md5 check of all my files and a week later it showed that everything hashed correctly. I figured it was an anomaly.

 

I went ahead and upgraded to 6.0.0 since the final had just come out. Everything was going good and at about 30 days of uptime I decided to start converting drives from reiserfs to xfs since reiserfs was doing the annoying "drives are almost full so I don't want to write large files to fill them totally up" thing. About half way through converting my disks, again during a write, unraid went unresponsive. Telnet, http, all not working. I IPMI'd in and the screen was again showing the typical Unraid login, but was totally unresponsive. I rebooted, ran a parity sync, got a bunch of sync errors and now it's going to take another week to md5 hash everything. Annoying. This is obviously becoming a worrisome trend.

 

I ran memtest and everything was ok there. I run unraid without any plugins or virtualization, totally stock. I'm currently running tail -f /var/log/syslog and waiting. All I can do wait for another freeze? Every crash takes a solid week to hash check everything, so I'd love some ideas to be a little bit more, uh, proactive. The common thread seems to be freezes during writes to the array. Any ideas? Thanks so much!

 

Specs:

CPU: Intel G3220

Motherboard: Supermicro X10SL7-F

Ram: 4GB ECC

Power Supply: Corsair RM650

Link to comment

When there is no response at all (no GUI, no telnet) the first suspect is the network connection.

 

Do you have a fixed IP address or do you use DHCP, in the latter case it may loose or change IP address over time.

 

Did you check the status of the ethernet port itself when using IPMI ? Port up and proper settings ?

 

Link to comment

When there is no response at all (no GUI, no telnet) the first suspect is the network connection.

 

Do you have a fixed IP address or do you use DHCP, in the latter case it may loose or change IP address over time.

 

Did you check the status of the ethernet port itself when using IPMI ? Port up and proper settings ?

 

Unraid has a fixed IP, as does IPMI and they share the same port/cable. I'm able to connect to IPMI when the server freezes, so I don't think it's a network thing. The server runs headless, but I'm able to see Unraid's screen using the java console redirect and it's frozen, my keyboard and the virtual keyboard don't do anything.

 

Syslog attachment is incoming,  but it obviously doesn't show what happened before the forced reboot.

Link to comment

Saw this in my Supermicro event log,  perhaps the freezing is a RAM issue after all?

 

Event Log:12 event entries

  Event ID    Time Stamp    Sensor Name    Sensor Type    Description 

1 2015/01/13 11:40:54 OEM Memory Correctable Memory ECC @ DIMMA1(CPU1) - Asserted

2 2015/01/13 11:40:55 OEM Memory Uncorrectable Memory ECC @ DIMMA1(CPU1) - Asserted

3 2015/05/22 03:47:07 OEM Memory Correctable Memory ECC @ DIMMA1(CPU1) - Asserted

4 2015/05/22 03:47:07 OEM Memory Uncorrectable Memory ECC @ DIMMA1(CPU1) - Asserted

5 2015/06/06 00:32:30 OEM Memory Correctable Memory ECC @ DIMMA1(CPU1) - Asserted

6 2015/06/06 00:32:30 OEM Memory Uncorrectable Memory ECC @ DIMMA1(CPU1) - Asserted

7 2015/07/16 19:16:52 OEM Memory Correctable Memory ECC @ DIMMA1(CPU1) - Asserted

8 2015/07/16 19:16:52 OEM Memory Uncorrectable Memory ECC @ DIMMA1(CPU1) - Asserted

9 2015/07/19 19:24:55 OEM Memory Correctable Memory ECC @ DIMMA1(CPU1) - Asserted

10 2015/07/19 19:24:55 OEM Memory Uncorrectable Memory ECC @ DIMMA1(CPU1) - Asserted

11 2015/07/24 14:06:29 OEM Memory Correctable Memory ECC @ DIMMA1(CPU1) - Asserted

12 2015/07/24 14:06:29 OEM Memory Uncorrectable Memory ECC @ DIMMA1(CPU1) - Asserted

 

Link to comment

Locked up hard again about ~30gb into a copy. So, it's happening more often now (only a few days of uptime this time). There is one more event in the supermicro log, but it's from yesterday, so not when it froze up.

 

Event Log:14 event entries

  Event ID        Time Stamp        Sensor Name        Sensor Type        Description 

1  2015/01/13 11:40:54  OEM  Memory  Correctable Memory ECC @ DIMMA1(CPU1) - Asserted

2  2015/01/13 11:40:55  OEM  Memory  Uncorrectable Memory ECC @ DIMMA1(CPU1) - Asserted

3  2015/05/22 03:47:07  OEM  Memory  Correctable Memory ECC @ DIMMA1(CPU1) - Asserted

4  2015/05/22 03:47:07  OEM  Memory  Uncorrectable Memory ECC @ DIMMA1(CPU1) - Asserted

5  2015/06/06 00:32:30  OEM  Memory  Correctable Memory ECC @ DIMMA1(CPU1) - Asserted

6  2015/06/06 00:32:30  OEM  Memory  Uncorrectable Memory ECC @ DIMMA1(CPU1) - Asserted

7  2015/07/16 19:16:52  OEM  Memory  Correctable Memory ECC @ DIMMA1(CPU1) - Asserted

8  2015/07/16 19:16:52  OEM  Memory  Uncorrectable Memory ECC @ DIMMA1(CPU1) - Asserted

9  2015/07/19 19:24:55  OEM  Memory  Correctable Memory ECC @ DIMMA1(CPU1) - Asserted

10  2015/07/19 19:24:55  OEM  Memory  Uncorrectable Memory ECC @ DIMMA1(CPU1) - Asserted

11  2015/07/24 14:06:29  OEM  Memory  Correctable Memory ECC @ DIMMA1(CPU1) - Asserted

12  2015/07/24 14:06:29  OEM  Memory  Uncorrectable Memory ECC @ DIMMA1(CPU1) - Asserted

13  2015/07/30 1:30:44  OEM  Memory  Correctable Memory ECC @ DIMMA1(CPU1) - Asserted

14  2015/07/30 1:30:44  OEM  Memory  Uncorrectable Memory ECC @ DIMMA1(CPU1) - Asserted

 

I was running a tail in my browser and console, but it failed to capture anything unusual,  I've attached it still, so I think that points to a hardware problem, correct? I'm at a complete loss of what to do.

 

Basically since I've owned the server it's been 200 days uptime > freeze, 30 days uptime > freeze, 7 days uptime > freeze, and this last time 4 days uptime > freeze. So no issues until the last month and a half and since then it's really had issues. Ouch.

 

Edit: I was just running memtest and another memory event appeared in the log. Does ECC ram generate errors in memtest? I also thought ECC ram was supposed to prevent crashes due to memory errors though?

Link to comment
  • 2 weeks later...
  • 2 years later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.