Soup Posted October 23, 2017 Share Posted October 23, 2017 (edited) Hi All, My unRAID system has been randomly hanging for the past few weeks, resolution is a hard reset. I've captured the logs from one of the events via remote syslog... what other information would be helpful to get some assistance identifying the issue? The hang isn't a complete hang, just seems like resource starvation preventing any type of connectivity. (i'm still able to ping) Any assistance would be greatly appreciated. Thanks Edit: Seeing a lot of this in the syslog I've captured...if that's helpful at all Quote Oct 19 14:31:50 Tower kernel: Call Trace: Oct 19 14:31:50 Tower kernel: [<ffffffff813a4a1b>] dump_stack+0x61/0x7e Oct 19 14:31:50 Tower kernel: [<ffffffff810cb5b1>] warn_alloc+0x102/0x116 Oct 19 14:31:50 Tower kernel: [<ffffffff810d7980>] ? try_to_free_pages+0x9e/0xa5 Oct 19 14:31:50 Tower kernel: [<ffffffff810cbb67>] __alloc_pages_nodemask+0x541/0xc71 Oct 19 14:31:50 Tower kernel: [<ffffffff810d133c>] ? __page_cache_release+0x1d0/0x1df Oct 19 14:31:50 Tower kernel: [<ffffffff810e95c2>] ? wp_page_copy+0x560/0x586 Oct 19 14:31:50 Tower kernel: [<ffffffff81103997>] alloc_pages_vma+0x183/0x1f5 Oct 19 14:31:50 Tower kernel: [<ffffffff810e90f7>] wp_page_copy+0x95/0x586 Oct 19 14:31:50 Tower kernel: [<ffffffff810ebdc0>] ? alloc_set_pte+0x322/0x490 Oct 19 14:31:50 Tower kernel: [<ffffffff810ea3e3>] do_wp_page+0x17a/0x5c8 Oct 19 14:31:50 Tower kernel: [<ffffffff810ee516>] handle_mm_fault+0xc72/0xf96 Oct 19 14:31:50 Tower kernel: [<ffffffff81042252>] __do_page_fault+0x24a/0x3ed Oct 19 14:31:50 Tower kernel: [<ffffffff81042438>] do_page_fault+0x22/0x27 Oct 19 14:31:50 Tower kernel: [<ffffffff81680f18>] page_fault+0x28/0x30 Oct 19 15:13:59 Tower kernel: [<ffffffff81117aef>] ? get_mem_cgroup_from_mm+0x9c/0xa4 Oct 19 15:13:59 Tower kernel: [<ffffffff81102d82>] alloc_pages_current+0xbe/0xe8 Oct 19 15:13:59 Tower kernel: [<ffffffff810c92d4>] __get_free_pages+0x9/0x37 Oct 19 15:13:59 Tower kernel: [<ffffffff81046693>] pgd_alloc+0x16/0xf8 Oct 19 15:13:59 Tower kernel: [<ffffffff8104a40b>] mm_init+0x15f/0x1bc Oct 19 15:13:59 Tower kernel: [<ffffffff8104b98f>] copy_process.part.4+0xc1d/0x1822 Oct 19 15:13:59 Tower kernel: [<ffffffff81122f1b>] ? get_empty_filp+0x4e/0x162 Oct 19 15:13:59 Tower kernel: [<ffffffff8110b189>] ? __slab_alloc.isra.15+0x26/0x39 Oct 19 15:13:59 Tower kernel: [<ffffffff8104c72f>] _do_fork+0xb7/0x2af Oct 19 15:13:59 Tower kernel: [<ffffffff81123047>] ? alloc_file+0x18/0x95 Oct 19 15:13:59 Tower kernel: [<ffffffff8104c999>] SyS_clone+0x14/0x16 Oct 19 15:13:59 Tower kernel: [<ffffffff81002dbb>] do_syscall_64+0x157/0x1c7 Oct 19 15:13:59 Tower kernel: [<ffffffff8113984e>] ? fd_install+0x20/0x22 Oct 19 15:13:59 Tower kernel: [<ffffffff81573404>] ? SyS_socketpair+0x148/0x1a0 Oct 19 15:13:59 Tower kernel: [<ffffffff8167f5eb>] entry_SYSCALL64_slow_path+0x25/0x25 Edited October 23, 2017 by Soup Quote Link to comment
Soup Posted October 23, 2017 Author Share Posted October 23, 2017 So, it looks like this happens when the mover kicks in to migrate from the cache drive to the array I identified that rsync was using a large % of CPU.... found 3 processes attempting to move the same file to the array.... killed them and the system restored to normal. Any thoughts? Quote Link to comment
trurl Posted October 23, 2017 Share Posted October 23, 2017 Tools - Diagnostics. Post complete zip Quote Link to comment
Soup Posted October 23, 2017 Author Share Posted October 23, 2017 Attached tower-diagnostics-20171023-1104.zip Quote Link to comment
Soup Posted October 23, 2017 Author Share Posted October 23, 2017 Doesn't look like the right syslog was included in there... syslog-2017-10-23.tgz Quote Link to comment
blak0137 Posted October 24, 2017 Share Posted October 24, 2017 This same thing with the same errors has been happening to me. The mover triggers or I manually run it and rsync goes haywire and uses all of the CPU on the box (or close to). Then the OOM thread reaper starts killing off threads and eventually all shares disappear, the WebUI slows to a complete crawl and everything connected to the array gets IO errors. I caught it in the act today and attempted to run diagnostics while it was happening but the diagnostics script kept getting killed by the system. I was able to run a 'killall rsync' and everything became responsive again and was able to complete a diagnostics that I have attached for inspection. I am just getting started with unraid, having moved from freenas for the flexibility but this is going to make me move back in short order. I have disabled the cache completely, so mover shouldn't get me again, hopefully. newnas-diagnostics-20171024-0938.zip Quote Link to comment
JorgeB Posted October 24, 2017 Share Posted October 24, 2017 This should help with OOM errors when running the mover with v6.3.5: Quote Link to comment
blak0137 Posted October 24, 2017 Share Posted October 24, 2017 I will give that a try and see if I can cause it to happen again. Quote Link to comment
Soup Posted October 29, 2017 Author Share Posted October 29, 2017 I disabled my cache drive on the shares I was experiencing the issue on and it hasn't happened since. I'll re-enable the cache drive on one of them, adjust these settings as mentioned in the linked post and see if it happens again. (32G of RAM in the system, other post mentions the issue common with systems using greater then 8G) Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.