Unreliable and Unresponsive


russell

Recommended Posts

Unraid seems to be the perfect tool for our business. We need a nice reliable bunch of disks storage and some VM's. Purchased the product after a trial period, and had some issues during the trial period. We were so committed to this we purchased a new motherboard and cpu . Gigabyte GA-X170 -WS with Xeon CPU and 16GB RAM. We have 10TB of storage.

The problems we are having we have had with both the first build and the second build.

 

We are a computer repairs and retail business with a high degree of hardware skills, and solid networking skills. Our network is first class in its build and extremely stable on all of our windows computers.

 

We have tried:

  • New MB and CPU (total of 2 completely different mb/cpu configs)
  • 3 different NIC Chipsets (Currently on INTEL built in on the motherboard)
  • 2 different RAM configurations
  • 2 different USB boot sticks with fresh installs of unraid
  • Different network switches
  • Different patch points on the network and different cables
  • connecting directly to our draytec router, or connecting directly to one of our gigabit switches
  • Single NICS and Bridged NICS (Bound together)

 

We still frustratingly get exactly the same symptoms

 

Symptoms

The RAID is randomly not accessible to browse on our network, either by IP address or computer name

The GUI interface is mostly not accessible from any browser on any machine in our network

 

A cold reboot fixes these issues for about 4 hours, sometimes more, often less

 

Finally...

We have paid for this as sign of our commitment to getting this working, and surely it must work! We need serious help from the developers to get this up and going urgently. Either that or we waste the many hours and $$ we have put in so far that our business is currently (trying) to rely on our new server.

 

Diagnostics now attached below...

max-diagnostics-20171120-1316.zip

Edited by russell
Link to comment

When unRAID is started up, you can login from the console and type

diagnostics

the resulting zip file will be at logs directory in the flash drive. If you can't get it via the flash share, you can shutdown unRAID, and pop the USB in another PC to the diagnostics file.

I don't think anybody can give you help without anymore details.

 

Link to comment

You have several disks with SMART issues, such as relocated sectors, end-to-end errors etc.

At least two drives have - or have had - transfer issues between drive and controller card.

 

In the system log, one of the drives had transfer errors and had the connection reset. Serial number Z4Z8GE7V, so at least that drive needs to have cable checked.

Nov 19 13:35:46 Max kernel: ata5.00: ATA-9: ST2000DM006-2DM164,             Z4Z8GE7V, CC26, max UDMA/133
Nov 19 13:35:46 Max kernel: ata5.00: 3907029168 sectors, multi 16: LBA48 NCQ (depth 31/32), AA

Nov 19 13:41:18 Max kernel: ata5.00: exception Emask 0x10 SAct 0x0 SErr 0x280100 action 0x6 frozen
Nov 19 13:41:18 Max kernel: ata5.00: irq_stat 0x08000000, interface fatal error
Nov 19 13:41:18 Max kernel: ata5: SError: { UnrecovData 10B8B BadCRC }
Nov 19 13:41:18 Max kernel: ata5.00: failed command: READ DMA EXT
Nov 19 13:41:18 Max kernel: ata5.00: cmd 25/00:40:e0:a8:7f/00:05:00:00:00/e0 tag 10 dma 688128 in
Nov 19 13:41:18 Max kernel:         res 50/00:00:df:a8:7f/00:00:00:00:00/e0 Emask 0x10 (ATA bus error)
Nov 19 13:41:18 Max kernel: ata5.00: status: { DRDY }
Nov 19 13:41:18 Max kernel: ata5: hard resetting link
Nov 19 13:41:19 Max kernel: ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

Nov 19 13:41:24 Max kernel: ata5.00: exception Emask 0x10 SAct 0x0 SErr 0x280100 action 0x6 frozen
Nov 19 13:41:24 Max kernel: ata5.00: irq_stat 0x08000000, interface fatal error
Nov 19 13:41:24 Max kernel: ata5: SError: { UnrecovData 10B8B BadCRC }
Nov 19 13:41:24 Max kernel: ata5.00: failed command: READ DMA EXT
Nov 19 13:41:24 Max kernel: ata5.00: cmd 25/00:40:d0:60:85/00:05:00:00:00/e0 tag 13 dma 688128 in
Nov 19 13:41:24 Max kernel:         res 50/00:00:cf:60:85/00:00:00:00:00/e0 Emask 0x10 (ATA bus error)
Nov 19 13:41:24 Max kernel: ata5.00: status: { DRDY }
Nov 19 13:41:24 Max kernel: ata5: hard resetting link
Nov 19 13:41:24 Max kernel: ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

...

 

Some hours laters, lots of strange things seems to happen. Possibly because of hardware. Possibly because of BIOS setting compared to requirements of forwarding of hardware to Docker. But lots of different programs starts to fail:
 

Nov 19 18:59:35 Max kernel: traps: notify[6576] general protection ip:2adab837c249 sp:7ffe719445a0 error:0
Nov 19 18:59:35 Max kernel: in libcrypto.so.1.0.0[2adab8261000+227000]

Nov 20 02:16:13 Max kernel: notify[12247]: segfault at 1000f25038 ip 0000000000622354 sp 00007ffd11ee7f60 error 4 in php[400000+724000]
Nov 20 02:17:18 Max kernel: traps: php[12725] general protection ip:605221 sp:7ffc74a513e0 error:0
Nov 20 02:17:18 Max kernel: in php[400000+724000]
Nov 20 02:17:18 Max kernel: traps: php[12727] general protection ip:607460 sp:7ffdcc5f0058 error:0
Nov 20 02:17:18 Max kernel: in php[400000+724000]
Nov 20 02:17:20 Max kernel: traps: php[12737] general protection ip:6291ff sp:7ffd6aca2dc0 error:0
Nov 20 02:17:20 Max kernel: in php[400000+724000]
Nov 20 02:17:20 Max kernel: traps: php[12745] general protection ip:62659f sp:7ffda5b51cc0 error:0
Nov 20 02:17:20 Max kernel: in php[400000+724000]
Nov 20 02:17:23 Max kernel: traps: php[12757] general protection ip:2b70f14b5cc0 sp:7fffd5d76738 error:0
Nov 20 02:17:23 Max kernel: in libcurl.so.4.4.0[2b70f14ab000+6d000]
Nov 20 02:17:23 Max kernel: traps: php[12758] general protection ip:60b44d sp:7ffff7e530e0 error:0
Nov 20 02:17:23 Max kernel: in php[400000+724000]
Nov 20 02:17:23 Max kernel: php[12766]: segfault at 1000ebe1a8 ip 00000000006221c2 sp 00007ffd88c43e00 error 4 in php[400000+724000]
Nov 20 02:17:23 Max kernel: traps: php[12768] general protection ip:62659f sp:7ffeff58c460 error:0
Nov 20 02:17:23 Max kernel: in php[400000+724000]
Nov 20 02:17:26 Max kernel: php[12779]: segfault at 0 ip 00000000005b8c51 sp 00007fffd2046610 error 4 in php[400000+724000]
Nov 20 02:17:27 Max kernel: traps: php[12787] general protection ip:605221 sp:7fff37162d10 error:0
Nov 20 02:17:27 Max kernel: in php[400000+724000]
Nov 20 03:38:43 Max kernel: traps: smartctl[2171] general protection ip:2b2ff6c5c68d sp:7ffc4a718940 error:0
Nov 20 03:38:43 Max kernel: in libc-2.24.so[2b2ff6b75000+1bd000]
Nov 20 03:38:56 Max kernel: traps: cpuload[1848] general protection ip:464c60 sp:7ffd593a0fc0 error:0
Nov 20 03:38:56 Max kernel: in bash[400000+ff000]
Nov 20 04:23:50 Max kernel: traps: modprobe[2713] general protection ip:401f90 sp:7fff9d5fdb28 error:0
Nov 20 04:23:50 Max kernel: in kmod[400000+21000]
Nov 20 04:28:55 Max kernel: traps: modprobe[2800] general protection ip:401fd0 sp:7ffdf7ba1b78 error:0
Nov 20 04:28:55 Max kernel: in kmod[400000+21000]
Nov 20 04:38:51 Max kernel: traps: smartctl[2890] general protection ip:2b0cd7e7d592 sp:7ffd699a6300 error:0
Nov 20 04:38:51 Max kernel: in libc-2.24.so[2b0cd7df7000+1bd000]
Nov 20 05:08:55 Max kernel: traps: smartctl[3185] general protection ip:2af168f6094b sp:7ffcc3a94390 error:0
Nov 20 05:08:55 Max kernel: in libstdc++.so.6.0.21[2af168e94000+16b000]

 

Alas late in my time zone now, so no time to spend more time digging at logs.

Link to comment

Have re-cabled the SATA backpane on the 5 main hard disks, new sata and power cables.

Rebooted into GUI MODE, raid up, interface working on the unraid machine, however access to shares and interface via the network is still not functioning.

Latest diagnostics attached here...

max-diagnostics-20171120-1729.zip

 

NOW running a boot time memory test

Edited by russell
Link to comment

First step is to make sure the disks works well and as recommended do a longer memtest.

 

As long as you don't need to pass through hardware to docker containers you shouldn't normally need to worry about optimizing BIOS settings, unless you have a motherboard where the manufacturer defaults to try to slightly overclock the hardware.

Link to comment

I notice that the machine is configure to be local master for the windows workgroup.

 

Are all other machines happy with this, or do you might have multiple machines configured as master?

 

Having machines fight for which one is master can make it hard to find other machines and their shares on the local network.

Link to comment

I have now tried setting master to NO, and tested. Seemed better in terms of getting to the shares from other windows computers. However today (next day from the previous posts) we have 2 or 3 computers than cannot see the shares, and when you can, they drop off quickly. I have done a fresh diagnostics which is attached.

 

MEM test showed faulty memory. We have replaced and tested again, this has seemed to also make the web gui more reliable, but not 100% available. On the old memory it was about 20% of the time available, now on the new memory its around 70% of the time available.

 

max-diagnostics-20171121-1551.zip

Link to comment

Just a suggestion, but not real smart to start a topic in CAPITOL LETTERS saying 'unreliable and unstable' in a pre-sales support thread and calling yourself "a computer repairs and retail business with a high degree of hardware skills, and solid networking skills. Our network is first class in its build and extremely stable on all of our windows computers" while building a computer with known faulty parts and not testing the new parts.  Running a Windows network without understanding what local browser is and what workgroup names are, DNS forwarding to an Australian DNS while time zone set to Los Angeles.  All over the map and a mess.

Link to comment

I had a lot of words for you two here, but not worth it. This is a support forum, try being suportitive, or don't comment at all.

 

In that spirit, I am still having stability issues with the GUI and the Shares, have changed the RAM - tested all fine with the mem test on boot. Hmmm - new ram - still gui issues

 

Oh, can someone tell me if stability is effected by the time zone. We are testing, not in production yet, time zone will get there eventually, but do the disks only work when the sun comes up???

Link to comment
3 minutes ago, russell said:

can someone tell me if stability is effected by the time zone

no

3 minutes ago, russell said:

I am still having stability issues

You should ideally update to 6.3.5.  Beyond that, due to the nature of how the webUI operates on 6.3, it is not as responsive, and under certain circumstances can appear to hang.  6.4 solves all of that, and IMHO everyone should upgrade to it.

 

4 minutes ago, russell said:

and the Shares

Windows networking sucks, that the problem with shares dropping isn't limited to unRaid, but is a pervasive problem across the spectrum (and this happens even on Windows boxes talking to other Windows boxes

 

Many solutions posted here (and on many other forums), talking about Local Masters registry hacks, etc but when push comes to shove, the rock solid solution is to:

- Assign every computer on your network a static IP (ideally via the router)

- Completely forget about hostnames, and refer to everything via the IP address

- Add a Quick Access shortcut to your server.  You'll never look back.

Link to comment

Thanks SQUID. Great suggestions. We will be happy to try updating to that version. We initially selected what was available as a stable build, rather than the latest.

As a repair business, we have a ton of files and resources stored for use on computers that we repair. Up to 15 computers at a time are on the workshop bench being repaired. Since these are other peoples computers, setting manual networking settings is not very practical. We used to use a windows box for this, and didn't have the share problem near as much as we do now. I will try upgrading to the latest version.

 

Is there an upgrade tool / exe to upgrade our USB stick, or should we just start up a new trial on a new key? What do you think is best?

Link to comment

That is exactly how we have been navigating to it - IP address 192.168.8.86 - fixed ip, outside of our DHCP scope in a reserved area of the router for fixed networking assets. This is the same method we used on the windows storage box... and is simple to navigate too, but shares just often do not respond, or they browse for a minute and disappear half way through file copying from the shares. Most annoying, as when it works it hammers along!

Link to comment
  • russell changed the title to Unreliable and Unresponsive

Personally I’d boot into Safe mode with No plugins and run with no VM’s or anything complicated to make sure your running as basic as a system as possible. Something is tripping up your system and I’m a firm believer of starting from scratch and step by step. 

 

Problems I’ve had in the past. Bad Cables, Faulty Power Supply, Faulty MotherBoard and a Bad Drive or 2. 

 

Ive never had any browsing issues when I set my unRAID machine to a Static IP browsing with a windows machine. My Mac doesn’t seem to browse as well, but honestly I don’t browse the network on my Mac so I never pursued it. lol 

Link to comment

Thanks for the ideas, most helpful. I have tried safe mode, and I have just made new usb boot on the latest version, removed all my disks, and starting fresh, have tried the re-cable, this time around Im cabling again with no sata hot swap bay (icy box internal hot swap was used) going to cable just 2 hard disks to the new install, these are 2 new 2tb sata that are 2 of the original server that are identified as having no errors.

So a virgin install, new cables, new hdd, new raid, tested ram... heres hoping it works!

Link to comment
1 hour ago, russell said:

Thanks for the ideas, most helpful. I have tried safe mode, and I have just made new usb boot on the latest version, removed all my disks, and starting fresh, have tried the re-cable, this time around Im cabling again with no sata hot swap bay (icy box internal hot swap was used) going to cable just 2 hard disks to the new install, these are 2 new 2tb sata that are 2 of the original server that are identified as having no errors.

So a virgin install, new cables, new hdd, new raid, tested ram... heres hoping it works!

 

Good luck! I personally don't have any issues with networking, although occasionally DNS not working perfectly and I scratch my head and do a workaround. For example, my Windows VM cannot access the server by its hostname. So I put an entry in the "hosts" file which takes care of that. I have my always-on unRAID server set as local master. Vanilla settings and IP address range. And no holes in my firewall for remote access (I use Teamviewer which does not require any).

 

You might have held back the info about being computer repair guys - as it opens yourself to a bit of ribbing :)

 

@HellDiverUK and @unevent - behave yourselves!

 

But I'm sure you'll figure this out quickly and be able to help some of our users that often have hardware issues and trouble isolating.

 

Let us know how things go! We are a mostly helpful group and enjoy helping our fellow unRAIDers out.

 

-SSD

 

Link to comment

@SSD, thanks for the helpful advise. Yes, in hind site we should have left off our background, we thought it relevant to help stop the basic obvious stuff, we are pretty full bottle on windows, but babies in Un-Raid. I am currently on a fresh USB, separate network, and only 2 hard drives all re-cabled, and fingers crossed, seems stable. It is currently on a standalone network so it is not interfacing with the workshop yet. If this stays stable, the problem will most likely be a networking fight going on with our network.

 

I have always thought that might be the issue, but needed to try and isolate all possibilities. UN-RAID is fantastic in its features, we certainly were not setting out to blame it, I bet you have many thousands of happy customers.

 

So for now, we will keep testing this very basic virgin installation, and if stable for a few days, then we will try and re-introduce it to the rest of the network again

 

Thanks again for the helpful advise, and of course we will be more than happy to offer hardware advise if ever needed.

 

 

 

Link to comment

Good luck. You’ll get it licked in no time. Its a pretty cool OS that honestly once you figure out the little HowTo’s you’ll never goto something else. 

 

Just remember we are all here to help. Occasionally we give each grief but typically it’s all in good fun. Glad I never mentioned I used to work on Nuclear Missiles when I was on a Sub in the Navy. xD I’m sure I’d never hear the end of it either. 

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.