(Solved) ata1.00 exception, failed command in syslog


Recommended Posts

Hi all,

I've recently started having issues with my server. It seemed to completely lock up every day or so. When this happened I could not reach it via ssh, the web gui, and plex stopped responding too.

 

At first I had to hard reset the server, but I started paying more attention to the syslog and later I also made sure the syslog was written to a file (using: tail -f /var/log/syslog > /mnt/user/data/syslog.txt). After looking through them I found these errors that seemed to occur whenever a lot of stuff was being written to the HDDs:

Jun 17 21:35:03 Tesla kernel: ata1.00: exception Emask 0x50 SAct 0x1000 SErr 0x280900 action 0x6 frozen
Jun 17 21:35:03 Tesla kernel: ata1.00: irq_stat 0x08000000, interface fatal error
Jun 17 21:35:03 Tesla kernel: ata1: SError: { UnrecovData HostInt 10B8B BadCRC }
Jun 17 21:35:03 Tesla kernel: ata1.00: failed command: READ FPDMA QUEUED
Jun 17 21:35:03 Tesla kernel: ata1.00: cmd 60/48:60:d0:c2:8e/00:00:0c:00:00/40 tag 12 ncq dma 36864 in
Jun 17 21:35:03 Tesla kernel:         res 40/00:64:d0:c2:8e/00:00:0c:00:00/40 Emask 0x50 (ATA bus error)
Jun 17 21:35:03 Tesla kernel: ata1.00: status: { DRDY }
Jun 17 21:35:03 Tesla kernel: ata1: hard resetting link
Jun 17 21:35:03 Tesla kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jun 17 21:35:03 Tesla kernel: ata1.00: configured for UDMA/133
Jun 17 21:35:03 Tesla kernel: ata1: EH complete

Sometimes theres mere seconds between these errors, and sometimes it's fine for 5 minutes. What I've found that really sets it off is backing up all docker data and then invoking the mover.

 

What I mostly would like to know is: to which drive does ata1.00 correspond? So as I am writing this I am looking through the diagnostics and found this at the beginning of the syslog:

Jun 17 17:30:17 Tesla kernel: ata1.00: ATA-8: SanDisk SD6SF1M128G, 133287403247, X231200, max UDMA/133

So it seems like it is one of my SSDs, and not an HDD as I would have thought. I've already tried replacing my data drive, and was gonna try replacing the cache, but it seems like I'll have to replace this SSD. I have already tried using different SATA cables, and my PSU is very new and probably a bit overkill for the system, so I doubt either one of those is the problem.

 

For now I'll have to rip out the SSD (and just run on one for the time being) and see if that fixes the problem. I'll post any updates here when I have tested more.

 

My diagnostics: tesla-diagnostics-20170617-2200.zip

Edited by Luca_Scorpion
Add (Solved) tag
Link to comment

Update: I pulled out the SSD, started the server, and started a backup to check if the errors would be gone. I noticed though that the SSD seemed to be rebalancing blocks, and it was making the backup really slow. At that point I saw that the SSD was still assigned to slot 2, which (as I read) could cause that. So I tried to stop the array so I could reassign it. However, it did not like that. The system completely locked up and I ended up having to power down via the cli.

 

After restarting the system I found that my SSD has become unmountable, which sucks but since I have backups of everything I'll simply reformat it and restore a backup.

 

Lessons learned:

- Don't set your array to autostart to prevent these kinds of problems.

- Backup your docker configs, domains, etc.

- Don't be impatient. This whole problem wouldn't have existed if I wouldn't have tried to rush the backup and let it rebalance the blocks. Though I don't know how long something like that generally takes.

Link to comment

Update 2: I've restored all data, created a backup and invoked the mover again. It's been running for about 15 minutes now without any problems, so it looks like it was in fact the SSD that screwed things up.

 

Update 2.5: The mover just now finished, with no more errors showing up in the syslog. I'll leave the syslog writing to a file for a  few days probably but I don't expect any further problems.

 

Steps to diagnose which disk is giving off the errors:

- Find the disk ata number in the syslog (can be viewed in the web gui by clicking the log button on the top-right), should look something like: ata1.00.

- Go to Tools > System Log

- Use ctrl+f to find the ata number ("ata1.00" for example), and look at the first hits. One of these will state the disk name, like in my case:

Jun 17 17:30:17 Tesla kernel: ata1.00: ATA-8: SanDisk SD6SF1M128G, 133287403247, X231200, max UDMA/133

That's it! You've found the source of your problems.

Edited by Luca_Scorpion
Add guide to find disk from ata number
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.