System Hanging - Need assistance identifying cause


Soup

Recommended Posts

Hi All,

 

My unRAID system has been randomly hanging for the past few weeks, resolution is a hard reset.  I've captured the logs from one of the events via remote syslog... what other information would be helpful to get some assistance identifying the issue?

 

The hang isn't a complete hang, just seems like resource starvation preventing any type of connectivity. (i'm still able to ping)

 

Any assistance would be greatly appreciated.

 

Thanks

 

Edit: Seeing a  lot of this in the syslog I've captured...if that's helpful at all

Quote

Oct 19 14:31:50 Tower kernel: Call Trace:
Oct 19 14:31:50 Tower kernel: [<ffffffff813a4a1b>] dump_stack+0x61/0x7e
Oct 19 14:31:50 Tower kernel: [<ffffffff810cb5b1>] warn_alloc+0x102/0x116
Oct 19 14:31:50 Tower kernel: [<ffffffff810d7980>] ? try_to_free_pages+0x9e/0xa5
Oct 19 14:31:50 Tower kernel: [<ffffffff810cbb67>] __alloc_pages_nodemask+0x541/0xc71
Oct 19 14:31:50 Tower kernel: [<ffffffff810d133c>] ? __page_cache_release+0x1d0/0x1df
Oct 19 14:31:50 Tower kernel: [<ffffffff810e95c2>] ? wp_page_copy+0x560/0x586
Oct 19 14:31:50 Tower kernel: [<ffffffff81103997>] alloc_pages_vma+0x183/0x1f5
Oct 19 14:31:50 Tower kernel: [<ffffffff810e90f7>] wp_page_copy+0x95/0x586
Oct 19 14:31:50 Tower kernel: [<ffffffff810ebdc0>] ? alloc_set_pte+0x322/0x490
Oct 19 14:31:50 Tower kernel: [<ffffffff810ea3e3>] do_wp_page+0x17a/0x5c8
Oct 19 14:31:50 Tower kernel: [<ffffffff810ee516>] handle_mm_fault+0xc72/0xf96
Oct 19 14:31:50 Tower kernel: [<ffffffff81042252>] __do_page_fault+0x24a/0x3ed
Oct 19 14:31:50 Tower kernel: [<ffffffff81042438>] do_page_fault+0x22/0x27
Oct 19 14:31:50 Tower kernel: [<ffffffff81680f18>] page_fault+0x28/0x30
Oct 19 15:13:59 Tower kernel: [<ffffffff81117aef>] ? get_mem_cgroup_from_mm+0x9c/0xa4
Oct 19 15:13:59 Tower kernel: [<ffffffff81102d82>] alloc_pages_current+0xbe/0xe8
Oct 19 15:13:59 Tower kernel: [<ffffffff810c92d4>] __get_free_pages+0x9/0x37
Oct 19 15:13:59 Tower kernel: [<ffffffff81046693>] pgd_alloc+0x16/0xf8
Oct 19 15:13:59 Tower kernel: [<ffffffff8104a40b>] mm_init+0x15f/0x1bc
Oct 19 15:13:59 Tower kernel: [<ffffffff8104b98f>] copy_process.part.4+0xc1d/0x1822
Oct 19 15:13:59 Tower kernel: [<ffffffff81122f1b>] ? get_empty_filp+0x4e/0x162
Oct 19 15:13:59 Tower kernel: [<ffffffff8110b189>] ? __slab_alloc.isra.15+0x26/0x39
Oct 19 15:13:59 Tower kernel: [<ffffffff8104c72f>] _do_fork+0xb7/0x2af
Oct 19 15:13:59 Tower kernel: [<ffffffff81123047>] ? alloc_file+0x18/0x95
Oct 19 15:13:59 Tower kernel: [<ffffffff8104c999>] SyS_clone+0x14/0x16
Oct 19 15:13:59 Tower kernel: [<ffffffff81002dbb>] do_syscall_64+0x157/0x1c7
Oct 19 15:13:59 Tower kernel: [<ffffffff8113984e>] ? fd_install+0x20/0x22
Oct 19 15:13:59 Tower kernel: [<ffffffff81573404>] ? SyS_socketpair+0x148/0x1a0
Oct 19 15:13:59 Tower kernel: [<ffffffff8167f5eb>] entry_SYSCALL64_slow_path+0x25/0x25

 

Edited by Soup
Link to comment

So, it looks like this happens when the mover kicks in to migrate from the cache drive to the array

 

I identified that rsync was using a large % of CPU.... found 3 processes attempting to move the same file to the array.... killed them and the system restored to normal.

 

Any thoughts?

Link to comment

This same thing with the same errors has been happening to me. The mover triggers or I manually run it and rsync goes haywire and uses all of the CPU on the box (or close to). Then the OOM thread reaper starts killing off threads and eventually all shares disappear, the WebUI slows to a complete crawl and everything connected to the array gets IO errors. I caught it in the act today and attempted to run diagnostics while it was happening but the diagnostics script kept getting killed by the system. I was able to run a 'killall rsync' and everything became responsive again and was able to complete a diagnostics that I have attached for inspection. I am just getting started with unraid, having moved from freenas for the flexibility but this is going to make me move back in short order. I have disabled the cache completely, so mover shouldn't get me again, hopefully.

newnas-diagnostics-20171024-0938.zip

Link to comment

I disabled my cache drive on the shares I was experiencing the issue on and it hasn't happened since.

 

I'll re-enable the cache drive on one of them, adjust these settings as mentioned in the linked post and see if it happens again.

 

(32G of RAM in the system, other post mentions the issue common with systems using greater then 8G)

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.