troubleshooting hung docker

whiteatom · January 8, 2017

Hi there,

I'm having trouble with my dockers not stopping when I try to stop the array. Everything just hangs up and I can't even seem to kill them off with docker kill or docker rm -r. I noticed this yesterday when I tried to stop a docker and couldn't (from the web ui). The GUI hung and i couldn't even restart from the command line.. A scary hard reset later and everything was fine, but now, it's happened again trying to upgrade a docker.

I have no idea where to go from here? Where to I look to determine what's causing this? How do I kill docker off? I'm pre-clearing a disk right now, so I reeeeeealy don't want to restart.

Thanks.

Squid · January 8, 2017

Hi there,

I'm having trouble with my dockers not stopping when I try to stop the array. Everything just hangs up and I can't even seem to kill them off with docker kill or docker rm -r. I noticed this yesterday when I tried to stop a docker and couldn't (from the web ui). The GUI hung and i couldn't even restart from the command line.. A scary hard reset later and everything was fine, but now, it's happened again trying to upgrade a docker.

I have no idea where to go from here? Where to I look to determine what's causing this? How do I kill docker off? I'm pre-clearing a disk right now, so I reeeeeealy don't want to restart.

Thanks.

since you can telnet in,

diagnostics

and if that hangs

cp /var/log/syslog.txt /boot/syslog.txt

and post the file when it happens again.

whiteatom · January 8, 2017

Diagnostics runs fine. Output attached. I looked through the Docker log... only thing I see there is the "name conflict for CouchPotato" - that's when I tried an upgrade of the container that started the problem today.

I have tried removing some CA plugins (the auto upgrade one for starts) so see if that changes things... no improvements.

Thanks for the help.

Update: I tried "docker exec -it CouchPotato /bin/bash" to see if I could get into the container.. it does nothing for about 3 seconds and then just dumps me back to the host prompt. Tried the same command on a docker I haven't tried to exit and I get a root@<container-id> prompt immediately. It really looks like these containers are stuck at the tail end of a shutdown if the terminal is not responding.

One other thing to note. I am running 2 other processes - a pre-clear as already stated, and I am zeroing an old disk (still in the array), so I can remove it without having to rebuild parity (according to this process). This process always kinda screws with the system because one of the data drives is loosing it's file system. I have nothing mapped to that disk, only to user shares, and that drive is excluded from all shares, but I thought it was worth mentioning.

Cheers

knox-diagnostics-20170108-2021.zip

Squid · January 9, 2017

Haven't seen that before best I can suggest would be to edit docker.cfg on the flash drive config folder and disable docker through it.

Then from The command prompt do

powerdown -r

Which should be able to restart the server. After it comes back alive delete the docker.img from the docker tab and recreate it then add the apps back in via CA and the previous apps section

Sent from my SM-T560NU using Tapatalk

whiteatom · January 9, 2017

The powerdown command won't work either... the system is TOTALLY locked up by docker. Docker won't quit, the array won't stop, so nothing can be done. I set the docker config to no as you suggest and I guess I have to go through another hard reboot (pre clear just finished).

Before rebooting, I ran the diagnostics (attached), but here's the only indication of a problem I can find, from docker.log:

time="2017-01-09T13:44:12.367055840-03:30" level=info msg="Container 2d9e18b61a4dd529ab1924b36aa5312d43324fa5a0665b3984a95668a6f23d63 failed to exit within 10 seconds of SIGTERM - using the force"

time="2017-01-09T13:44:22.367326715-03:30" level=info msg="Container 2d9e18b61a4d failed to exit within 10 seconds of kill - trying direct SIGKILL"

Anyone else have any tips here?

knox-diagnostics-20170109-1411.zip

Squid · January 9, 2017

you can try

/etc/rc.d/rc.docker stop
umount /var/lib/docker
[code]
and then see if the powerdown will work.... 

But I doubt either command will work properly

whiteatom · January 9, 2017

you can try

/etc/rc.d/rc.docker stop
umount /var/lib/docker
[code]
and then see if the powerdown will work.... 

But I doubt either command will work properly

After I tried the powerdown -r, I looked in "ps aux" and the /etc/rc/d/rc.docker stop was running for 40mins before I got bored and hard reset the box. It's back up now without docker, so I'll try removing the img as you suggested.

Squid · January 9, 2017

you can try
/etc/rc.d/rc.docker stop
umount /var/lib/docker
[code]
and then see if the powerdown will work.... 

But I doubt either command will work properly
After I tried the powerdown -r, I looked in "ps aux" and the /etc/rc/d/rc.docker stop was running for 40mins before I got bored and hard reset the box. It's back up now without docker, so I'll try removing the img as you suggested.

Like I said, I have no clue what went wrong on the update - never seen it before, and my apps update every week...

whiteatom · January 9, 2017

Ok.. still no improvements.

Renamed the docker.img to docker.old and started fresh. Added a new docker for plex and it started fine - tested Plex and then tried to shut ti down. (click stop on the UI). The UI hung up for about 20 seconds and then I got a "Execution error Error code" error pop-up on the screen. The docker.log shows the same thing...

time="2017-01-09T14:53:45.536132191-03:30" level=info msg="API listen on /var/run/docker.sock"
time="2017-01-09T15:08:08.694244741-03:30" level=info msg="Container c30d84dab0d26e609175a008cf49db034ea02b20979747192513d4801cfb5477 failed to exit within 10 seconds of SIGTERM - using the force"
time="2017-01-09T15:08:18.694564666-03:30" level=info msg="Container c30d84dab0d2 failed to exit within 10 seconds of kill - trying direct SIGKILL"

And the docker is still running, even through Plex has shut down.

Any new ideas?

EDIT: eventually the docker did shut down, (nothing listed on docker ps), but the logs don't show anything....

Squid · January 9, 2017

Maybe post the whole diagnostics before you reboot. Might be something else going on

Sent from my LG-D852 using Tapatalk

whiteatom · January 10, 2017

OK.. moving a bunch of data now.. but when it's done, I'll run the diagnostics and reboot. Probably a hard reboot again.. yikes!

Something is screwy because this only happened since I went to dual parity. docker.img is not on the array, so I don't really see how it's related, but there's gotta be something holding up the docker processes.

whiteatom · January 10, 2017

Ok.. An update for you all because this is not a problem any more.

It appears that when you are working the array hard and have a disk locked up, docker won't shut down cleanly - or at least the ones I have. Today, I was tail -f'ing the docker.log from earlier troubleshooting and in the same second my dd process (I'm zeroing an old disk for removal) ended, the docker log updated with the locked up containers exiting and the updates I had requested processing.

I'm not sure if this is a knowing behaviour, a limitation, or a bug, but it's pretty easy to get around now I know, but someone else will probably run into it at some point.

whiteatom

Mettbrot · October 2, 2017

Hi,

I have a similar problem. Since a couple of weeks I cannot seem to stop the docker containers, leading to unraid hanging on shutdown. I tried to update one of the dockers when I noticed the issue:

After the download it said

stopping docker: error

removing image: error

starting docker: unable to start because a docker with the same name is running

/etc/rc.d/rc.docker stop is not running forever and shutdown does not work. I had to force shut down several times resulting in parity checks with up to 20 bad sectors.

I recently removed the docker image file and reset everything. But now I have those errors again.

Diagnostics are attached

server-diagnostics-20171002-1503.zip

Slamer · April 26, 2018

Hi there,

I'm experiencing the same issue. The docker image Nzbget is freezing and can't stop it or remove it. This was happened after adding a volume mapping -v /mnt/usr/Exchange/intermediate:/intermediate as suggested by the install documentation from linuxserver/nzbget (https://hub.docker.com/r/linuxserver/nzbget/)

I've tried using command line but no success: docker stop <container id>

Diagnostics are attached :

tower-diagnostics-20180426-1936.zip

Thanks for your help.

trurl · April 26, 2018

2 minutes ago, Slamer said:

This was happened after adding a volume mapping -v /mnt/usr/Exchange/intermediate:/intermediate as suggested by the install documentation from linuxserver/nzbget

I hope this is a typo, since /mnt/usr does not correspond to any actual storage and so would be a new folder created in RAM. All of the unRAID user shares are at /mnt/user

Also looks like corruption in the cache pool

Apr 26 10:37:14 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr 0, rd 1, flush 0, corrupt 0, gen 0
Apr 26 10:37:14 Tower kernel: blk_partition_remap: fail for partition 1
Apr 26 10:37:14 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr 1, rd 1, flush 0, corrupt 0, gen 0
Apr 26 10:37:14 Tower kernel: blk_partition_remap: fail for partition 1
Apr 26 10:37:14 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr 1, rd 2, flush 0, corrupt 0, gen 0
Apr 26 10:37:14 Tower kernel: blk_partition_remap: fail for partition 1
Apr 26 10:37:14 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr 2, rd 2, flush 0, corrupt 0, gen 0

JorgeB · April 26, 2018

Those are read/write errors on cache2:

Apr 26 10:37:14 Tower kernel: sd 1:0:0:0: [sdb] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
Apr 26 10:37:14 Tower kernel: sd 1:0:0:0: [sdb] tag#0 CDB: opcode=0x28 28 00 00 60 08 a0 00 00 20 00
Apr 26 10:37:14 Tower kernel: print_req_error: I/O error, dev sdb, sector 6293664

Check cables, then run a correcting scrub

Slamer · April 27, 2018

Hi,

I finally succeeded to restart the server. After that I've started a parity check and obtained until now 535048090 Corrected Sync Errors after 18 hours. The estimated end time is in 23 hours!

Is that a normal behaviour to get a lot by checking the parity check? Is the issue is coming from the HDDs? What's your recommandation to investigate.

For information, I'm using :

2 SSD : 250GB (eSATA)+500GB(UDB3.0) for Caching

2 HD : 2*3TB(eSATA) for Parity

6 HD : 2*3TB(eSATA)+4*2TB(USB3.0) for Data

Thanks a lot.

trurl · April 27, 2018

2 minutes ago, Slamer said:

Is that a normal behaviour to get a lot by checking the parity check?

The only acceptable number of sync errors is exactly zero. After you correct those sync errors do another parity check to make sure you have zero sync errors.

OFark · May 17, 2020

I just had this problem with Plex. Couldn't terminate it, web page just showed a 503, console access was instantly disconnected.

How I fixed it was to go to the Docker tab, advanced mode and force update it. Viola, up and running again.

troubleshooting hung docker

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation