coppit

Community Developer
  • Posts

    496
  • Joined

  • Last visited

Converted

  • Gender
    Undisclosed

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

coppit's Achievements

Enthusiast

Enthusiast (6/14)

21

Reputation

1

Community Answers

  1. Just to follow up on this, in case anyone else has similar problems. mprime would warn about hardware errors within 5 seconds of starting. (The memtest ran a complete test without any errors.) I did two things that helped: 1) I saw a video where Intel CPUs would become unstable if the cooler was tightened too much on an Asrock Tai Chi motherboard. I removed my cooler and reinstalled, tightening just until I felt resistance. That stopped mprime from failing quickly. However after about 5 minutes I still got another error. The parity check after doing this found about 60 errors... 2) I increased my CPU voltage by two steps. (.10 V I think.) That seems to have removed that last of the instability.
  2. Okay, finally a breakthrough. The server hanged in the BIOS bootup screen. My theory is that there is a problem with my hyperthreaded CPU cores 0-15. 16-31 are allocated to my VMs. The lower cores are used by docker and UNRAID. Disabling docker basically idles those cores since UNRAID doesn't use much CPU, which is why things are stable in that configuration. I'm going to try disabling hyperthreading to see if that helps. I might have to replace the CPU. Or maybe the motherboard. Ugh. Since my CPU temps never go above 41C even under load, I don't think it's overheating.
  3. It's been another 5 days without a crash. This time I changed my NPM to use bridge networking instead of br0. Now I'm going to enable PiHole, which uses br0. If it crashes again then we'll know it's related to br0.
  4. I woke up to a hung server. It seems Docker related. Next I'll try enabling specific containers, starting with Plex due to popular demand. Any other suggestions on how to debug this would be welcome.
  5. Oops. I missed that. Sorry. I disabled the Docker service, and the server ran for 5 days without any crashes. Before this, the longest it ran was about 2.5 days before a crash. Now I'm going to re-enable Docker, but only use containers with bridge or host networking. I suspect it's related to br0, so I've disabled NGINX Proxy manager and two PiHole containers. I'll let this soak for another 5 days. One reason I suspect br0 is that I've had trouble with NPM being able to route requests to my bridge mode containers. When I hopped into one of the containers, it couldn't ping the server. I figured I would debug this later, but I mention it in case it might be relevant.
  6. Okay, I ran it in safe mode for several days. After 53 hours it crashed. Syslog attached. I noticed that docker was running. Could it be related to that? I'm getting intermittent warnings about errors on one of my pooled cache drives. Could that cause kernel crashes? I replaced my power supply, increased all my fans to max, and passed Memtest86. CPU temp is 29°C, and the motherboard is 37°C. Should I try replacing the CPU? Motherboard? Is there a way to isolate the cause? syslog-20240117-145920.txt
  7. Here's another from last night. Instead of a null pointer dereference, this time there was a page fault for an invalid address. syslog-20240110-045742.txt
  8. I'm looking for some help with my server hanging up. When it happens while I'm using it, things start failing one by one. Maybe the internet fails, but I can still ssh into the server from my VM. Then the VM hangs. Then I can't reach the web UI from my phone. At that point nothing responds. I have to power cycle the server. But more often I come to the server and it's just frozen. This was happening 1-2 times per day. My memory passes memcheck86+, and I replaced the power supply with no effect. I installed the driver for my RTL8125 NIC (even though my crashes don't seem exactly the same), and it had no effect. I upgraded to 6.12.6 pretty much as soon as it came out. The crashes started happening mid-December. I don't recall making any server changes around that time, but perhaps I did. I changed my VMs from br0 to virbr0 as an experiment, and my crashes seem to happen only once every 2-3 days now. (But now the VMs can't use PiHole DNS because they can't reach docker.) So maybe the crashes are still related to my NIC? Attached is a syslog with a couple of call traces in it, as well as my system diagnostics. Any ideas would be appreciated. storage-diagnostics-20240109-1527.zip syslog-20231227-232940.txt
  9. I spent some time with SpaceInvaderOne this morning on this. We assigned the server a different IP and everything started working again. Then for fun we assigned the old IP to an IPad, and it exhibited the same behavior. So it's definitely related to my TP-Link XE75 router, and not the server. Ed's only recommendation there was to factory reset the router and set it up again. So this can be closed as not a bug. Separately, I also had an issue where the VMs couldn't be reached, but could reach out. (That said, intermittently, I also had issues with the VMs being able to reach the server.) We switched to bridge networking and that issue cleared up as well. I guess I was too eager to follow the release notes. There might be some issue with not using bridge networking, but at this point I don't think I want to delve into it.
  10. It was fun decoding the .plg file to manually download and install the files with my laptop. (CA doesn't work without internet.) Based on the output of "modinfo r8125", it looks like the driver installed okay after I rebooted. I don't see any "network is unreachable" errors in the syslog, but that could be chance. Pinging the gateway still fails.
  11. After rebooting into normal mode, now sshd is failing to stay up. But then later in the log ntpd comes up fine. What the heck? System diagnostics attached. Based on the "Network is unreachable" errors, this thread suggested that I disable C states, but that didn't help. I manually ran "/etc/rc.d/rc.sshd start" after the server was up to get sshd running again. Thank god for the GUI console feature. I feel like my server is falling apart before my eyes... storage-diagnostics-no-sshd-20231107-0006.zip
  12. I stopped the docker service entirely. Oddly, my Windows VM that I was using lost connectivity to the unpaid host at that point. Weird. I hopped on my laptop and disabled the VM service as well. Then I rebooted into safe mode and started the array. The server still can't ping the gateway or 8.8.8.8. Safe mode diagnostics attached. This is with wg0, docker, and KVM all disabled. storage-diagnostics-safe-mode-20231106-2338.zip
  13. There's probably a lot of flotsam and jetsam in there dating back to the 5.x days. Disabled the tunnel and its autostart, and rebooted. No change. Oops. I forgot to try the other things. I'll reply in a few minutes with those results as well.
  14. Thanks for the assistance. I see that it's defaulting to 1500. Did you see something where the MTU was different? I got no warning from FCP. I stopped VMs and Docker, and manually set it to 1500. Then I re-enabled VMs and Docker, and rebooted for good measure. Still no luck.