Regular out of memory problems


Recommended Posts

I am suddenly getting regular out of memory issues on my server.  This has not been a problem until the last week or so and it is now happening every day or two as reported by Fix Common Problems.

 

I have 16GB of RAM and the Dashboard shows 45% used but Dynamix System Stats plugin shows <1GB free and 4GB cached so I don't know why they don't agree.

 

I don't have any VMs running but I do have a handful of docker containers (the biggest one being Crashplan using ~4GB RAM as reported by docker)

 

Attached is my diagnostic report that shows the Out of Memory and it killing Java.

aeris-diagnostics-20170719-1107.zip

Link to comment

I had this problem a few months back.  I fixed by installing the "Tips and Tweaks" plugin and on the 'Tweaks' tab making the following two changes.

 

Disk Cache 'vm.dirty_background_ratio' (%):    to 2%

Disk Cache 'vm.dirty_ratio' (%):                           to 4%

 

I have not had the problem since then.  (These two variables determine the amount of RAM that the OS uses as a cache because the OS delays writes to the disk(s) to improve the responsiveness  to interactive user applications.  It should be further realized that these defaults were set back when 1GB was a lot of RAM.)

 

EDIT:    (September 15, 2017) These settings may not be optimal for servers with large amounts of RAM (say more than 8GB).    There are a lot of very knowable folks who have set them as follows without any issues:

 

Disk Cache 'vm.dirty_background_ratio' (%):    to 1%

Disk Cache 'vm.dirty_ratio' (%):                           to 2%

 

There is no magic in the actual numbers themselves.  However, the Disk Cache ''vm.dirty.ratio' should be twice the value of the Disk Cache 'vm.dirty.background.ratio'.   If you are interested in the what and why of this Disk Caching scheme, read the next post.  With the speed and multi-core threading of modern CPU's, I wonder if there is any tangible benefits from the Disk Caching today for most users. 

 

 

Edited by Frank1940
  • Upvote 1
Link to comment

Have you checked in the support thread for crashplan.  (I seem to recall that some plugin had some issues (maintaining, or building, a db in memory, or something like that)  that could trigger out of memory errors but I can't remember which one it was.)

 

I delayed posting back this reply because I hoping someone else would jump in.  I downloaded your diagnostics file and the syslog is filled with oom events.  I am not enough of Guru to understand exactly what is going on and I was hoping that someone with a lot more knowledge would jump in. 

 

I have the feeling that they won't cause a problem until they kill some process that can't be restarted.  (unRAID does have at least one of those...)  But it is always good to figure out what is causing these issues and see if they can be prevented in some way. 

Link to comment
  • 2 weeks later...
On 7/19/2017 at 1:08 PM, Frank1940 said:

I had this problem a few months back.  I fixed by installing the "Tips and Tweaks" plugin and on the 'Tweaks' tab making the following two changes.

 

Disk Cache 'vm.dirty_background_ratio' (%):    to 2%

Disk Cache 'vm.dirty_ratio' (%):                           to 4%

 

 

Personally I used 1% and 2% and it seems to have fixed my OOM errors. I was getting OOM when Mover ran.

Link to comment
18 hours ago, coolspot said:

 

Personally I used 1% and 2% and it seems to have fixed my OOM errors. I was getting OOM when Mover ran.

May I ask how much memory in total you have ? I also got the OOM while mover's running, with lots of Transmission jobs.

Edited by georgez
Link to comment

Remember this the amount of RAM that is reserved for 'delayed' writes to the hard disks. All OS's do this.  The reason for this 'feature' can be easily explained.  Back in the day when 80386-20MHz was the king of the hill, when a program like a word processor did an auto-save of a document, if you were typing all of a sudden the letters would stop appearing on the screen.  When the save was complete, they would suddenly be there (hopefully).  Needless to save, this was very disconcerting to most fast touch typists.  To reduce the impact of this problem, this delayed-write scheme was devised.  The OS would wait until there was a pause in the users activity (i.e., typing) and then do the writes.  Remember also that 1 and 2MB  of RAM was all most computers had!  So 20% of that memory was approximately 200-400KB.  If you have a system that has 8GB of ram, you are allotting 1.6GB!!!  And your processor is running about 200 times faster.  Don't be afraid to try low values-- Just keep one twice the size of the other one.  You can even make the argument that it isn't needed at all on a server.  (This 'feature' is the reason that you have to eject all USB drives--- to guarantee that everything that is in that buffer is written to the drive before its removal.)

 

EDIT:  I can remember that a lot of us actually turned this feature off in the early MS windows systems because they are so unstable, you want to make sure that these auto-save writes were completed in case the d@mn thing did one of its three expected daily crashes before that file was completely written to the disk. 

Edited by Frank1940
Link to comment
On 8/6/2017 at 8:20 PM, georgez said:

May I ask how much memory in total you have ? I also got the OOM while mover's running, with lots of Transmission jobs.

 

I have 8GB of RAM.

 

I run Transmission as a docker - with maximum of 5 downloads.

 

Ever since I made the tweak I haven't received and OOM error. 

 

Link to comment

I'm also getting OOM errors when the mover starts.  I just tried to alter those two settings:

Disk Cache 'vm.dirty_background_ratio' (%):    to 2%

Disk Cache 'vm.dirty_ratio' (%):                           to 4%

 

My question is...  Are they changed on the fly or do I have to reboot the system?

 

Jim

Link to comment
1 hour ago, jbuszkie said:

My question is...  Are they changed on the fly or do I have to reboot the system?

 

 

Great Question.  Unfortunately, I don't have the answer but I tend to suspect that they are not changed on the fly.  It only takes a few minutes to reboot the system and that is what I do whenever I make a change of this type. 

Link to comment
  • 1 month later...
  • 7 months later...
On 8/6/2017 at 10:15 PM, Frank1940 said:

You can even make the argument that it isn't needed at all on a server. 

 

Frank, I'm running into OOM errors as well, since update to UR about a year ago.  Been dealing with braindead server after 1-3 days.  I've applied the tweak (1/2% on 16GB server), so we'll see if we get past a couple days.  You mentioned in the above post "You can even make the agrument that it isn't needed at all"...SO...is there a way to turn it off?  If so, consequences?  Thanks in advance...

Edited by jeffreywhunter
Link to comment
20 minutes ago, jeffreywhunter said:

SO...is there a way to turn it off?

 

Possibly, but why would you?  With your current settings, you are now down to a 160MB of RAM.  That is a rather small block (I believe it is a contingent block-- which is one of the problems-- since it allocated on bootup).  If you are having a problem with that setting, you have something else going on.  I would start with the logs to see if there is something there-- like thousands of duplicate lines.  

  • Like 1
Link to comment
41 minutes ago, Frank1940 said:

 

Possibly, but why would you?  With your current settings, you are now down to a 160MB of RAM.  That is a rather small block (I believe it is a contingent block-- which is one of the problems-- since it allocated on bootup).  If you are having a problem with that setting, you have something else going on.  I would start with the logs to see if there is something there-- like thousands of duplicate lines.  

 

So after I made the tweak, I rebooted and started some backups.  Server crashed within a couple of hours.  Interesting warning about 50 sec before it crashed. 

 

May 7 12:24:54 HunterNAS php-fpm[7897]: [WARNING] [pool www] server reached max_children setting (20), consider raising it

 

Tail of Syslog attached.

 

Hunternas Log Tail 20180507.txt

Link to comment
4 minutes ago, jeffreywhunter said:

 

So after I made the tweak, I rebooted and started some backups.  Server crashed within a couple of hours.  Interesting warning about 50 sec before it crashed. 

 

May 7 12:24:54 HunterNAS php-fpm[7897]: [WARNING] [pool www] server reached max_children setting (20), consider raising it

 

Tail of Syslog attached.

 

Hunternas Log Tail 20180507.txt

Did you look at that file?

 

Post your complete diagnostics.

  • Like 1
Link to comment

You rebooted before getting these. We need the diagnostics from when you are having the problem.

 

Not sure if you know this now but apparently you didn't at one time based on some other posts you had on another thread.

 

Rootfs isn't the flash drive. The flash drive is /boot. Rootfs is RAM. Your FTP issue from that log tail you posted makes me wonder if you aren't filling up RAM with something using an incorrect path somewhere.

  • Like 1
Link to comment
29 minutes ago, trurl said:

You rebooted before getting these. We need the diagnostics from when you are having the problem.

 

Not sure if you know this now but apparently you didn't at one time based on some other posts you had on another thread.

 

Rootfs isn't the flash drive. The flash drive is /boot. Rootfs is RAM. Your FTP issue from that log tail you posted makes me wonder if you aren't filling up RAM with something using an incorrect path somewhere.

 

The FTP thing could be an issue.  I'm looking at the FTP config and I turned on the transferlog using a path "TransferLog /mnt/user/My Backups/proftpxferlog - which is incorrect, I'm missing the \ - should be TransferLog /mnt/user/My\ Backups/proftpxferlog

 

I've made that change and rebooted.  I'm agreeing with your diagnosis.

 

Is there a command or log somewhere that would indicate the RAM usage/issue the path could create?

 

As to the diagnostics, once the server crashes, I don't have access.  That said, sometimes after the GUI locks up, I can still access the command line from the server.  Is there a way to pull diagnostics from that?  Certainly could pull the syslog and other files.  If it happens again, I'll try that.

 

Someone also shared a script to capture to the flash drive, but I've not had a couple hours to spend learning how that works (I'm just a hack...).

 

Hopefully the error in the path is the source.  Time will tell.  Thanks for the ideas.

Link to comment
36 minutes ago, jeffreywhunter said:

sometimes after the GUI locks up, I can still access the command line from the server.  Is there a way to pull diagnostics from that?

 

See the Need Help "sticky" near the top of this same subforum and also linked in my sig.

  • Like 1
Link to comment
1 hour ago, Frank1940 said:

The Fix Common Problems plugin has a troubleshooting mode that can be turned that will write the diagnostic file  plus a 'tail' log to the   logs   folder/directory on the flash drive on a periodic basis.  You could try that.  

 

I have turned that on.  It says in the confirmation screen:

 

Quote

When running in this mode the syslog is continually captured to the flash drive, and a diagnostics dump is performed every 30 minutes

 

Is there a way to make it dump quicker than 30 minutes?  i.e. if the problem only shows up in the last 29 min before the dump...  ;)

Link to comment
1 hour ago, Frank1940 said:

The Fix Common Problems plugin has a troubleshooting mode that can be turned that will write the diagnostic file  plus a 'tail' log to the   logs   folder/directory on the flash drive on a periodic basis.  You could try that.  

 

Also, just noticed in the tail these messages after I started FCP...

Quote

May 7 18:55:07 HunterNAS root: Fix Common Problems: Troubleshooting mode activated
May 7 18:55:07 HunterNAS root: Fix Common Problems: Capturing diagnostics. When uploading diagnostics to the forum, also upload /logs/FCPsyslog_tail.txt on the flash drive
May 7 18:56:05 HunterNAS root: Fix Common Problems Version 2018.04.25
May 7 18:56:06 HunterNAS root: Fix Common Problems: /var/log currently 1 % full
May 7 18:56:06 HunterNAS root: Fix Common Problems: rootfs (/) currently 7 % full

 

Anything to worry about here - i.e. rootfs filling too full?

Edited by jeffreywhunter
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.