troubleshooting hung docker


Recommended Posts

Hi there,

I'm having trouble with my dockers not stopping when I try to stop the array. Everything just hangs up and I can't even seem to kill them off with docker kill or docker rm -r. I noticed this yesterday when I tried to stop a docker and couldn't (from the web ui). The GUI hung and i couldn't even restart from the command line.. A scary hard reset later and everything was fine, but now, it's happened again trying to upgrade a docker.

 

I have no idea where to go from here? Where to I look to determine what's causing this? How do I kill docker off? I'm pre-clearing a disk right now, so I reeeeeealy don't want to restart.

 

Thanks.

Link to comment

Hi there,

I'm having trouble with my dockers not stopping when I try to stop the array. Everything just hangs up and I can't even seem to kill them off with docker kill or docker rm -r. I noticed this yesterday when I tried to stop a docker and couldn't (from the web ui). The GUI hung and i couldn't even restart from the command line.. A scary hard reset later and everything was fine, but now, it's happened again trying to upgrade a docker.

 

I have no idea where to go from here? Where to I look to determine what's causing this? How do I kill docker off? I'm pre-clearing a disk right now, so I reeeeeealy don't want to restart.

 

Thanks.

since you can telnet in,

diagnostics

 

and if that hangs

 

cp /var/log/syslog.txt /boot/syslog.txt 

and post the file when it happens again.

Link to comment

Diagnostics runs fine. Output attached. I looked through the Docker log... only thing I see there is the "name conflict for CouchPotato" - that's when I tried an upgrade of the container that started the problem today.

 

I have tried removing some CA plugins (the auto upgrade one for starts) so see if that changes things... no improvements.

 

Thanks for the help.

 

Update: I tried "docker exec -it CouchPotato /bin/bash" to see if I could get into the container.. it does nothing for about 3 seconds and then just dumps me back to the host prompt. Tried the same command on a docker I haven't tried to exit and I get a root@<container-id> prompt immediately. It really looks like these containers are stuck at the tail end of a shutdown if the terminal is not responding.

 

One other thing to note. I am running 2 other processes - a pre-clear as already stated, and I am zeroing an old disk (still in the array), so I can remove it without having to rebuild parity (according to this process). This process always kinda screws with the system because one of the data drives is loosing it's file system. I have nothing mapped to that disk, only to user shares, and that drive is excluded from all shares, but I thought it was worth mentioning.

 

 

 

Cheers

knox-diagnostics-20170108-2021.zip

Link to comment

Haven't seen that before  best I can suggest would be to edit docker.cfg on the flash drive config folder and disable docker through it.

 

Then from The command prompt do

 

powerdown -r 

 

Which should be able to restart the server.  After it comes back alive delete the docker.img from the docker tab and recreate it then add the apps back in via CA and the previous apps section 

 

Sent from my SM-T560NU using Tapatalk

 

 

Link to comment

The powerdown command won't work either... the system is TOTALLY locked up by docker. Docker won't quit, the array won't stop, so nothing can be done. I set the docker config to no as you suggest and I guess I have to go through another hard reboot (pre clear just finished).

 

Before rebooting, I ran the diagnostics (attached), but here's the only indication of a problem I can find, from docker.log:

 

time="2017-01-09T13:44:12.367055840-03:30" level=info msg="Container 2d9e18b61a4dd529ab1924b36aa5312d43324fa5a0665b3984a95668a6f23d63 failed to exit within 10 seconds of SIGTERM - using the force"

time="2017-01-09T13:44:22.367326715-03:30" level=info msg="Container 2d9e18b61a4d failed to exit within 10 seconds of kill - trying direct SIGKILL"

 

Anyone else have any tips here?

knox-diagnostics-20170109-1411.zip

Link to comment

you can try

/etc/rc.d/rc.docker stop
umount /var/lib/docker
[code]
and then see if the powerdown will work.... 

But I doubt either command will work properly

 

After I tried the powerdown -r, I looked in "ps aux" and the /etc/rc/d/rc.docker stop was running for 40mins before I got bored and hard reset the box. It's back up now without docker, so I'll try removing the img as you suggested.

Link to comment

you can try

/etc/rc.d/rc.docker stop
umount /var/lib/docker
[code]
and then see if the powerdown will work.... 

But I doubt either command will work properly

 

After I tried the powerdown -r, I looked in "ps aux" and the /etc/rc/d/rc.docker stop was running for 40mins before I got bored and hard reset the box. It's back up now without docker, so I'll try removing the img as you suggested.

Like I said, I have no clue what went wrong on the update - never seen it before, and my apps update every week...
Link to comment

Ok.. still no improvements.

 

Renamed the docker.img to docker.old and started fresh. Added a new docker for plex and it started fine - tested Plex and then tried to shut ti down. (click stop on the UI). The UI hung up for about 20 seconds and then I got a "Execution error Error code" error pop-up on the screen. The docker.log shows the same thing...

 

time="2017-01-09T14:53:45.536132191-03:30" level=info msg="API listen on /var/run/docker.sock"
time="2017-01-09T15:08:08.694244741-03:30" level=info msg="Container c30d84dab0d26e609175a008cf49db034ea02b20979747192513d4801cfb5477 failed to exit within 10 seconds of SIGTERM - using the force"
time="2017-01-09T15:08:18.694564666-03:30" level=info msg="Container c30d84dab0d2 failed to exit within 10 seconds of kill - trying direct SIGKILL"

 

And the docker is still running, even through Plex has shut down.

 

Any new ideas?

 

EDIT: eventually the docker did shut down, (nothing listed on docker ps), but the logs don't show anything....

 

Link to comment

OK.. moving a bunch of data now.. but when it's done, I'll run the diagnostics and reboot. Probably a hard reboot again.. yikes!

 

Something is screwy because this only happened since I went to dual parity. docker.img is not on the array, so I don't really see how it's related, but there's gotta be something holding up the docker processes.

Link to comment

Ok.. An update for you all because this is not a problem any more.

 

It appears that when you are working the array hard and have a disk locked up, docker won't shut down cleanly - or at least the ones I have. Today, I was tail -f'ing the docker.log from earlier troubleshooting and in the same second my dd process (I'm zeroing an old disk for removal) ended, the docker log updated with the locked up containers exiting and the updates I had requested processing.

 

I'm not sure if this is a knowing behaviour, a limitation, or a bug, but it's pretty easy to get around now I know, but someone else will probably run into it at some point.

 

whiteatom

 

Link to comment
  • 8 months later...

Hi,

 

I have a similar problem. Since a couple of weeks I cannot seem to stop the docker containers, leading to unraid hanging on shutdown. I tried to update one of the dockers when I noticed the issue:

After the download it said

stopping docker: error

removing image: error

starting docker: unable to start because a docker with the same name is running

 

/etc/rc.d/rc.docker stop is not running forever and shutdown does not work. I had to force shut down several times resulting in parity checks with up to 20 bad sectors.

 

I recently removed the docker image file and reset everything. But now I have those errors again.

 

Diagnostics are attached

server-diagnostics-20171002-1503.zip

Link to comment
  • 6 months later...

Hi there,

 

I'm experiencing the same issue. The docker image Nzbget is freezing and can't stop it or remove it. This was happened after adding a volume mapping -v /mnt/usr/Exchange/intermediate:/intermediate as suggested by the install documentation from linuxserver/nzbget (https://hub.docker.com/r/linuxserver/nzbget/)

I've tried using command line but no success: docker stop <container id>

 

Diagnostics are attached :

tower-diagnostics-20180426-1936.zip

 

Thanks for your help.

Link to comment
2 minutes ago, Slamer said:

This was happened after adding a volume mapping -v /mnt/usr/Exchange/intermediate:/intermediate as suggested by the install documentation from linuxserver/nzbget

 

I hope this is a typo, since /mnt/usr does not correspond to any actual storage and so would be a new folder created in RAM. All of the unRAID user shares are at /mnt/user

 

Also looks like corruption in the cache pool

Apr 26 10:37:14 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr 0, rd 1, flush 0, corrupt 0, gen 0
Apr 26 10:37:14 Tower kernel: blk_partition_remap: fail for partition 1
Apr 26 10:37:14 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr 1, rd 1, flush 0, corrupt 0, gen 0
Apr 26 10:37:14 Tower kernel: blk_partition_remap: fail for partition 1
Apr 26 10:37:14 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr 1, rd 2, flush 0, corrupt 0, gen 0
Apr 26 10:37:14 Tower kernel: blk_partition_remap: fail for partition 1
Apr 26 10:37:14 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr 2, rd 2, flush 0, corrupt 0, gen 0

 

Link to comment

Those are read/write errors on cache2:

 

Apr 26 10:37:14 Tower kernel: sd 1:0:0:0: [sdb] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
Apr 26 10:37:14 Tower kernel: sd 1:0:0:0: [sdb] tag#0 CDB: opcode=0x28 28 00 00 60 08 a0 00 00 20 00
Apr 26 10:37:14 Tower kernel: print_req_error: I/O error, dev sdb, sector 6293664

 

Check cables, then run a correcting scrub

Link to comment

Hi,

I finally succeeded to restart the server. After that I've started a parity check and obtained until now 535048090 Corrected Sync Errors after 18 hours. The estimated end time is in 23 hours! 

 

Is that a normal behaviour to get a lot by checking the parity check? Is the issue is coming from the HDDs? What's your recommandation to investigate. 

 

For information, I'm using :

2 SSD : 250GB (eSATA)+500GB(UDB3.0) for Caching

2 HD : 2*3TB(eSATA) for Parity

6 HD : 2*3TB(eSATA)+4*2TB(USB3.0) for Data

 

Thanks a lot.

Link to comment
  • 2 years later...

I just had this problem with Plex. Couldn't terminate it, web page just showed a 503, console access was instantly disconnected.

How I fixed it was to go to the Docker tab, advanced mode and force update it. Viola, up and running again.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.