tyrindor Posted May 22, 2017 Share Posted May 22, 2017 (edited) Sometimes when I restart unRAID, it only detects my 1G network. The first time I reboot, it never detects it. If I then hit "reboot" again and it usually detects it, about 90% of the time. If that fails, I reboot again and never had that not detect it. After it detects it, I can go months without any issues on the 10G network. Latest unRAID. This makes me believe there's some sort of timeout issue, where my 10G network can't be detected unless X amount of time has passed. The 10G NIC is a MELLANOX CONNECTX-2, directly connected to a computer with another MELLANOX CONNECTX-2. Here's a fresh reboot where it did not detect it: syslog.txt Edited May 22, 2017 by tyrindor Quote Link to comment
JorgeB Posted May 22, 2017 Share Posted May 22, 2017 (edited) Not much help but the problem is not unRAID, I have the same NIC on 4 of my servers and it always works, in your case linux is not detecting it, try a different slot, look for bios upgrade, etc. Edited May 22, 2017 by johnnie.black Quote Link to comment
tyrindor Posted May 22, 2017 Author Share Posted May 22, 2017 (edited) The mobo is a Supermicro X9SCM. All 4 PCi-E slots are populated with either this NIC or SAS2LP card. Latest BIOS. I'll try swapping a SAS2LP and NIC card around. Edited May 22, 2017 by tyrindor Quote Link to comment
1812 Posted May 23, 2017 Share Posted May 23, 2017 Here is something interesting, I experienced a similar problem with the same model card recently. I have 2 of them, used them both for 6 months, connected to a LB4M. In one server, about a week ago, the card disappeared from it while running. The server then complained that eth0 was missing (as it is set to that.) I rebooted and it was not in the device list. I figured the card died. But as a check, I moved it to another slot and it worked. After needing to upgrade a video card a few days later, I cycled power again, and again it disappeared. A few days later, and after some data transfers via the backup networking, I cycled the power again, and it again came back. On a nearly identical machine, that card has had no problems ever. So, maybe it is a flaky card? Not sure. Time will tell. Quote Link to comment
saarg Posted May 23, 2017 Share Posted May 23, 2017 Have you tried changing the pcie gen in the bios for the slot the card is in? Quote Link to comment
JorgeB Posted May 23, 2017 Share Posted May 23, 2017 7 hours ago, tyrindor said: The mobo is a Supermicro X9SCM One of mine is also on a X9SCM, so maybe a flaky card. Quote Link to comment
tyrindor Posted May 26, 2017 Author Share Posted May 26, 2017 (edited) So I reformatted my Windows 10 PC this week, and now the same thing is happening on Windows with an entirely different ConnectX card. I tried swapping the cable, no luck. No way it's two faulty cards... and it's direct connection between the two cards. On windows, it's either instant or won't connect for 30-40 seconds on restart/boot. Then it works flawlessly. On unRAID, when it doesn't detect, I have to reboot for it to connect. Any clues? Close to ditching my 10G network because when this happens my mapped drives hang my entire window system on boot for about 30 seconds. Edited May 26, 2017 by tyrindor Quote Link to comment
JorgeB Posted May 26, 2017 Share Posted May 26, 2017 They work great for me, only thing I need to is if I restart an unRAID server with my Windows desktop on the link will stay down, I just need to right-click the NIC, disable and re-enable to get it up. Quote Link to comment
tyrindor Posted May 26, 2017 Author Share Posted May 26, 2017 (edited) I'm looking through my event log in Windows and I see these errors: The File Transfer (SMB) performance may be affected as Network Direct functionality is not supported in ConnectX-2 firmware version. Mellanox ConnectX-2 Ethernet Adapter device reports that the "QOS (ETS) capability is missing". The current firmware does not support the QOS (ETS) capability. Please burn the latest firmware and restart your machine. (The issue is reported in Function SetHardwareAssistAttributes). SingleFunc_4_0_0: RoCE v2.0 mode was requested, but it is not supported. The NIC starts in RoCE v1.0 mode. NOTE: If your environment contains mix of different NIC types, you need to make sure that the whole environment is configured to use the same RoCE mode, otherwise the traffic between the different NICs does not work. The following boot-start or system-start driver(s) did not load: dam Does this mean anything to you? Last I checked, I had the latest firmware. Edited May 26, 2017 by tyrindor Quote Link to comment
tyrindor Posted May 26, 2017 Author Share Posted May 26, 2017 Well I am out of ideas unless these cards are both dying or something. If it happens on both Windows and unRAID, with two different cables. Here's my windows/unRAID settings for the direction connection... everything seem right? Quote Link to comment
JorgeB Posted May 26, 2017 Share Posted May 26, 2017 You should enable jumbo frames for better performance, but that won't have nothing to do with your issues. Quote Link to comment
tyrindor Posted May 26, 2017 Author Share Posted May 26, 2017 I actually just disabled them to test if that was why it was happening Quote Link to comment
tyrindor Posted June 4, 2017 Author Share Posted June 4, 2017 (edited) This is definitely caused by something in my unraid server, but I am not sure why I am the only one experiencing it. I had my friend stop by who's a network administrator and troubleshoots stuff like this for businesses every day. He was here for 6 hours and completely baffled. During the first 60-120 seconds of Windows fresh boot: - You can ping the 10G IP even when you can't access the 10G server via IP in windows. - You can access the WebUI via the 10g IP when you can't access the 10G server via IP in windows. - Clicking on the map drives or browsing to the IP directly either results in a time out, or asks for a username/password. My password doesn't work, and even if my shares are set to public it still does this. - After 60-120 seconds or so of both systems being up, the mapped drives start working and you can browse to the 10G IP with no password needed. - Logging off and back on the Windows computer, this problem doesn't happen because the NIC connection isn't broken. The windows computer must restart or be shut off, which breaks the 10G connection causing this issue. - If I disconnect my 1G network completely from both systems, it connects faster bringing the 60-120 delay down to about 5-10 seconds. This makes no sense because it's a completely different network. This is the log during some of this testing: Jun 3 23:08:37 UNRAID kernel: mlx4_en: eth1: Link Down Jun 3 23:08:39 UNRAID ntpd[1693]: Deleting interface #3 eth1, 192.168.1.101#123, interface stats: received=0, sent=0, dropped=0, active_time=97 secs Jun 3 23:09:02 UNRAID kernel: mlx4_en: eth1: Link Up Jun 3 23:09:02 UNRAID kernel: mlx4_en: eth1: Link Down Jun 3 23:09:02 UNRAID kernel: mlx4_en: eth1: Link Up Jun 3 23:09:04 UNRAID ntpd[1693]: Listen normally on 4 eth1 192.168.1.101:123 Jun 3 23:09:04 UNRAID ntpd[1693]: new interface(s) found: waking up resolver Jun 3 23:29:50 UNRAID kernel: mlx4_en: eth1: Link Down Jun 3 23:29:51 UNRAID ntpd[1693]: Deleting interface #4 eth1, 192.168.1.101#123, interface stats: received=0, sent=0, dropped=0, active_time=1247 secs Jun 3 23:30:09 UNRAID kernel: mlx4_en: eth1: Link Up Jun 3 23:30:11 UNRAID ntpd[1693]: Listen normally on 5 eth1 192.168.1.101:123 Jun 3 23:30:11 UNRAID ntpd[1693]: new interface(s) found: waking up resolv Jun 4 01:43:02 UNRAID kernel: mlx4_en: eth1: Link Down Jun 4 01:43:03 UNRAID ntpd[1693]: Deleting interface #5 eth1, 192.168.1.101#123, interface stats: received=0, sent=0, dropped=0, active_time=7972 secs Jun 4 02:27:09 UNRAID kernel: mdcmd (68): spindown 1 Jun 4 08:41:21 UNRAID kernel: mlx4_en: eth1: Link Up Jun 4 08:41:23 UNRAID ntpd[1693]: Listen normally on 6 eth1 192.168.1.101:123 Jun 4 08:41:23 UNRAID ntpd[1693]: new interface(s) found: waking up resolver ----- His first theory was unRAiD wasn't the local master, but he was surprised when it was. We tried two windows PC with the same NIC cards, and had no problems (not faulty cards). unRAID seems to be deleting the 10G NIC when it doesn't detect a computer on the other side, and when it does it tries to re-establish it but it takes awhile for SMB to start working. We tried all 4 PCI slots on the server motherboard. Clueless why this was working flawlessly for over a year, and this started happening soon as I reformatted Windows 10 on the PC connected to the server. To me it still seems like some type of local master issue. I am really out of ideas other than dropping $700-$800 on a 10G switch and hoping it fixes it. Edited June 4, 2017 by tyrindor Quote Link to comment
Albahttiti Posted March 31, 2018 Share Posted March 31, 2018 hey i seem to have the same issue did you find a solution? @tyrindor Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.