unRAID doesn't always detect my 10G NIC


Recommended Posts

Sometimes when I restart unRAID, it only detects my 1G network. The first time I reboot, it never detects it. If I then hit "reboot" again and it usually detects it, about 90% of the time. If that fails, I reboot again and never had that not detect it. After it detects it, I can go months without any issues on the 10G network. Latest unRAID.

 

This makes me believe there's some sort of timeout issue, where my 10G network can't be detected unless X amount of time has passed. The 10G NIC is a  MELLANOX CONNECTX-2, directly connected to a computer with another MELLANOX CONNECTX-2.

 

Here's a fresh reboot where it did not detect it:

syslog.txt

Edited by tyrindor
Link to comment

Here is something interesting, I experienced a similar problem with the same model card recently. I have 2 of them, used them both for 6 months, connected to a LB4M. In one server, about a week ago, the card disappeared from it while running. The server then complained that eth0 was missing (as it is set to that.) I rebooted and it was not in the device list. I figured the card died. But as a check, I moved it to another slot and it worked. After needing to upgrade a video card a few days later, I cycled power again, and again it disappeared. A few days later, and after some data transfers via the backup networking, I cycled the power again, and it again came back.

 

On a nearly identical machine, that card has had no problems ever.

 

So, maybe it is a flaky card? Not sure. Time will tell.

Link to comment

So I reformatted my Windows 10 PC this week, and now the same thing is happening on Windows with an entirely different ConnectX card. I tried swapping the cable, no luck. No way it's two faulty cards... and it's direct connection between the two cards.

 

On windows, it's either instant or won't connect for 30-40 seconds on restart/boot. Then it works flawlessly. On unRAID, when it doesn't detect, I have to reboot for it to connect.

 

Any clues? Close to ditching my 10G network because when this happens my mapped drives hang my entire window system on boot for about 30 seconds.

Edited by tyrindor
Link to comment

I'm looking through my event log in Windows and I see these errors:

 

The File Transfer (SMB) performance may be affected as Network Direct functionality is not supported in ConnectX-2 firmware version.

 

Mellanox ConnectX-2 Ethernet Adapter device reports that the "QOS (ETS) capability is missing". The current firmware does not support the QOS (ETS) capability. Please burn the latest firmware and restart your machine. (The issue is reported in Function SetHardwareAssistAttributes).

 

SingleFunc_4_0_0: RoCE v2.0 mode was requested, but it is not supported. The NIC starts in RoCE v1.0 mode. 
 NOTE: If your environment contains mix of different NIC types, you need to make sure that the whole environment is configured to use the same RoCE mode, 
 otherwise the traffic between the different NICs does not work.

 

The following boot-start or system-start driver(s) did not load: 
dam

 

Does this mean anything to you? Last I checked, I had the latest firmware.

Edited by tyrindor
Link to comment
  • 2 weeks later...

This is definitely caused by something in my unraid server, but I am not sure why I am the only one experiencing it. I had my friend stop by who's a network administrator and troubleshoots stuff like this for businesses every day. He was here for 6 hours and completely baffled.

 

During the first 60-120 seconds of Windows fresh boot:

- You can ping the 10G IP even when you can't access the 10G server via IP in windows.

- You can access the WebUI via the 10g IP when you can't access the 10G server via IP in windows.

- Clicking on the map drives or browsing to the IP directly either results in a time out, or asks for a username/password. My password doesn't work, and even if my shares are set to public it still does this.

- After 60-120 seconds or so of both systems being up, the mapped drives start working and you can browse to the 10G IP with no password needed.

- Logging off and back on the Windows computer, this problem doesn't happen because the NIC connection isn't broken. The windows computer must restart or be shut off, which breaks the 10G connection causing this issue.

- If I disconnect my 1G network completely from both systems, it connects faster bringing the 60-120 delay down to about 5-10 seconds. This makes no sense because it's a completely different network.

 

This is the log during some of this testing:

Jun  3 23:08:37 UNRAID kernel: mlx4_en: eth1: Link Down
Jun  3 23:08:39 UNRAID ntpd[1693]: Deleting interface #3 eth1, 192.168.1.101#123, interface stats: received=0, sent=0, dropped=0, active_time=97 secs
Jun  3 23:09:02 UNRAID kernel: mlx4_en: eth1: Link Up
Jun  3 23:09:02 UNRAID kernel: mlx4_en: eth1: Link Down
Jun  3 23:09:02 UNRAID kernel: mlx4_en: eth1: Link Up
Jun  3 23:09:04 UNRAID ntpd[1693]: Listen normally on 4 eth1 192.168.1.101:123
Jun  3 23:09:04 UNRAID ntpd[1693]: new interface(s) found: waking up resolver
Jun  3 23:29:50 UNRAID kernel: mlx4_en: eth1: Link Down
Jun  3 23:29:51 UNRAID ntpd[1693]: Deleting interface #4 eth1, 192.168.1.101#123, interface stats: received=0, sent=0, dropped=0, active_time=1247 secs
Jun  3 23:30:09 UNRAID kernel: mlx4_en: eth1: Link Up
Jun  3 23:30:11 UNRAID ntpd[1693]: Listen normally on 5 eth1 192.168.1.101:123
Jun  3 23:30:11 UNRAID ntpd[1693]: new interface(s) found: waking up resolv
Jun  4 01:43:02 UNRAID kernel: mlx4_en: eth1: Link Down
Jun  4 01:43:03 UNRAID ntpd[1693]: Deleting interface #5 eth1, 192.168.1.101#123, interface stats: received=0, sent=0, dropped=0, active_time=7972 secs
Jun  4 02:27:09 UNRAID kernel: mdcmd (68): spindown 1
Jun  4 08:41:21 UNRAID kernel: mlx4_en: eth1: Link Up
Jun  4 08:41:23 UNRAID ntpd[1693]: Listen normally on 6 eth1 192.168.1.101:123
Jun  4 08:41:23 UNRAID ntpd[1693]: new interface(s) found: waking up resolver

 

-----

 

His first theory was unRAiD wasn't the local master, but he was surprised when it was. We tried two windows PC with the same NIC cards, and had no problems (not faulty cards). unRAID seems to be deleting the 10G NIC when it doesn't detect a computer on the other side, and when it does it tries to re-establish it but it takes awhile for SMB to start working. We tried all 4 PCI slots on the server motherboard.

 

Clueless why this was working flawlessly for over a year, and this started happening soon as I reformatted Windows 10 on the PC connected to the server. To me it still seems like some type of local master issue.

 

I am really out of ideas other than dropping $700-$800 on a 10G switch and hoping it fixes it.

 

Edited by tyrindor
Link to comment
  • 9 months later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.