Marvel Issues: Starting Point For Investigation

Squid · July 30, 2017

Let me preface this by stating that I am one of the users who has absolutely zero problems with Marvel controllers. I run VM's with passthrough, I have never had a drive randomly drop offline, and I've never suffered from any corruption on the drives connected to the controllers.

I have always maintained that anyone who suffers from the above problems it is either a specific hardware combination causing it or they have sniffed too much glue, while other users maintain that it is a driver issue, and to avoid Marvel controllers like the plague.

Yesterday after thinking about the problems a bit more and recalling other problems a couple of years ago, I have successfully managed to recreate at will one of the symptoms (and possibly two) that users with, and have narrowed it down to a specific piece of hardware.

Back History

At the start of the v6 series, a number of users complained quite vocally about parity check slowdowns (significant) when using a supermicro SAS2LP.

Once again, I did not suffer from this problem. (Because any drives I had that would have caused this issue were not installed on the SAS2LP)

TLDR: Users suffering from slowdowns on when using the SAS2LP had one or more drives connected to the HBA that as the ATA version (you can see that ATA version by looking at the Identity Tab when you click on the drive from main) as ATA8-ACS @limetech @eschultz @jonp introduced a tunable into the system (nr_requests) to fix this. However, Tom didn't particularly like that solution, so he tweaked some driver code (https://forums.lime-technology.com/topic/40944-partially-solved-is-there-an-effort-to-solve-the-sas2lp-issue-tom-question/?page=16#comment-414289) to solve the problem without resorting to having the user change the tunable.

This worked, and solved the parity check slowdowns for affected users.

Today's Problems

Coinciding with the fix for the slowdowns being introduced into the system new problems started to appear with Marvel controllers that no one related back to the original slowdown issues:

Recurring 5 parity errors being corrected with every correcting parity check
Drives randomly dropping offline
Corruption randomly occuring on drives
When IOMMU / AMD-Vi enabled above problems could get worse.

I have managed to be able to replicate the recurring 5 parity errors on my secondary server at will by rearranging some hardware.

My secondary server under normal circumstances has its 3 TB hard drives connected to the motherboard (hold over from when that server was utilizing a Br10i controller). But, but placing its 3TB drives onto the SAS2LP now installed into it, from a fresh power on (clean shutdown), I have this:

Jul 28 12:03:45 Server_B kernel: md: recovery thread: P corrected, sector=1565565768
Jul 28 12:03:45 Server_B kernel: md: recovery thread: P corrected, sector=1565565776
Jul 28 12:03:45 Server_B kernel: md: recovery thread: P corrected, sector=1565565784
Jul 28 12:03:45 Server_B kernel: md: recovery thread: P corrected, sector=1565565792
Jul 28 12:03:45 Server_B kernel: md: recovery thread: P corrected, sector=1565565800

A subsequent parity check turns up zero errors. Perform a clean powerdown, restart the computer, and a new correcting parity check shows this:

Jul 29 18:05:02 Server_B kernel: md: recovery thread: P corrected, sector=1565565768
Jul 29 18:05:02 Server_B kernel: md: recovery thread: P corrected, sector=1565565776
Jul 29 18:05:02 Server_B kernel: md: recovery thread: P corrected, sector=1565565784
Jul 29 18:05:02 Server_B kernel: md: recovery thread: P corrected, sector=1565565792
Jul 29 18:05:02 Server_B kernel: md: recovery thread: P corrected, sector=1565565800

Note that the 5 errors are on the exact same sectors. The recurring 5 parity check errors only happen after restarts. If the 5 errors are corrected, then subsequent parity checks are clean so long as the system has no been reset.

Once I rearrange the drives back to their original controllers, the 5 parity check errors on clean starts are gone forever.

The drives that I've managed to replicate this on are these:

ST3000DM001-1CH166 and ST3000DM001-1CH166 (both installed simultaneously to the SAS2LP)

From the Identity Tab:

Model family:	        Seagate Barracuda 7200.14 (AF)
Device model:	        ST3000DM001-1CH166
Serial number:	        Z1F1Q0L2
LU WWN device id:       5 000c50 04f033b3a
Firmware version:    	CC24
User capacity:	        3,000,592,982,016 bytes [3.00 TB]
Sector sizes:	        512 bytes logical, 4096 bytes physical
Rotation rate:	        7200 rpm
Form factor:	        3.5 inches
Device:	                In smartctl database [for details use: -P show]
ATA version:	        ATA8-ACS T13/1699-D revision 4
SATA version:	        SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local time:             Sun Jul 30 10:20:50 2017 EDT
SMART support:	        Available - device has SMART capability.
SMART support:	        Enabled
SMART overall-health:	Passed

Note that the ATA Version is ATA8-ACS

However, the ATA version in and by itself is not the cause, as I have other drives utilizing that version, but they are all less than 3TB.

Should also be noted that Seagate themselves have updated the ST3000DM001's to not utilize ATA8-ACS. I have other ST3000DM001's that do not use that interface in my primary server:

Model family:           Seagate Barracuda 7200.14 (AF)
Device model:           ST3000DM001-1CH166
Serial number:          Z1F33KPN
LU WWN device id:       5 000c50 050a62f10
Firmware version:       CC27
User capacity:          3,000,592,982,016 bytes [3.00 TB]
Sector sizes:           512 bytes logical, 4096 bytes physical
Rotation rate:          7200 rpm
Form factor:            3.5 inches
Device:                 In smartctl database [for details use: -P show]
ATA version:            ACS-2, ACS-3 T13/2161-D revision 3b
SATA version:           SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local time:             Sun Jul 30 10:27:19 2017 EDT
SMART support:          Available - device has SMART capability.
SMART support:          Enabled
SMART overall-health:   Passed

Assumptions (And this is an assumption - almost a leap of faith)

While these results aren't exactly scientific, it is a decent starting point for trying to figure out a fix / why certain people are affected.

Based upon my results above, (in conjunction with the fact that my primary server was originally running the SAS2LP with zero problems (no ATA8-ACS drives connected to it) the initial assumptions would be:

Drives 3TB+ that utilize ATA8-ACS when connected to a Marvel Controller will give you the instability problems that some users have
Seagate ST3000DM001's that utilize ATA8-ACS may be able to be have their firmware upgraded to remove that interface from the drive (the firmware versions above do differ)
If you suffer from problems with Marvel Controllers, removing any drives (especially 3TB+) that utilize ATA8-ACS from the controller and instead placing them on the motherboard may solve your problems.
The problems some users have with Marvel Controllers may or may not have been introduced by the code changes made by Limetech to solve the parity check slowdown issues.
If you do not have any ATA8-ACS drives connected to a Marvel Controller, you will not have any issues at all)

While this isn't the end-all-be-all diagnosis of the issues (I do have better things to do than run parity check after parity check after parity check), it does at least somewhat prove that it is certain hardware combinations (in this case the drives themselves) that are causing the issues, and how to possibly work around them without having to invest any money in an expensive LSI controller card.

And if anyone is going to use this as a chance to bash Seagate (the best hard drives in the world), that is very premature, and I also have Hitachi drives (albeit < 3TB) that utilize ATA8-ACS, and it appears that only early ST3000DM001's used ATA8-ACS and that a firmware update to the drives may also fix the problem (not tested)

Edited July 30, 2017 by Squid

JorgeB · July 30, 2017

Interesting findings, though I'm not sure that the repeatable parity errors and disks dropping offline are necessarily related, I just checked a couple of old treads from users with dropped disks, one user with the SASLP and another with the SAS2LP, and in both cases the dropped disks weren't ATA8-ACS, still worth investigating but I maintain my recommendation, replace any SASLP/SAS2LP with an LSI controller because IMO they are a ticking time bomb.

Squid · July 30, 2017

5 minutes ago, johnnie.black said:

Interesting findings, though I'm not sure that the repeatable parity errors and disks dropping offline are necessarily related, I just checked a couple of old treads from users with dropped disks, one user with the SASLP and another with the SAS2LP, and in both cases the dropped disks weren't ATA8-ACS, still worth investigating but I maintain my recommendation, replace any SASLP/SAS2LP with an LSI controller because IMO they are a ticking time bomb.

Understand completely. Hence why I labelled the topic starting point for investigation. To my knowledge, I'm the first who was able to replicate some of the issues and point the finger at a certain piece of hardware. Not going to set up one of my production servers such that I think it might fail so that I can continue an investigation.

But with a possible culprit found, then people with more time on their hands can truly begin to find what the root cause is.

JorgeB · July 30, 2017

Also worth noting that those repeatable parity error result in data corruption, according to the tests done by S80_UK, so if users get those they should really get rid of them.

Edited July 30, 2017 by johnnie.black

Vr2Io · July 30, 2017

I have different think of this issue, my Marvell (9215) add-on card got SATA interface error, the problem could easy reproduce during disk write ( should be not happen in read ) and just some WD / Seagate happen ( they are all 3TB ). Same system haven't problem if use LSI / Asmedia controller.

But I have a QNAP ( also 9215 x2 8-Bays ) which running unRAID with those problem disk never got such problem, QNAP's product use lot of Marvell SATA controller.

I try got the Marvell firmwae update from ASROCK, but the link seems broken, btw not use Marvell now.

Edited July 30, 2017 by Benson

Squid · July 30, 2017

4 minutes ago, Benson said:

and just some WD / Seagate happen ( they are all 3TB ).

And that's exactly my point. Would be very interesting for you to post your diagnostics (even though you aren't using a Marvel anymore) to see if those drive(s) utilize ATA8-ACS, and why I'm suggesting that its entirely possible that @limetech's fix for one issue inadvertently created another one

Vr2Io · July 30, 2017

FYR, Actually Toshiba also got same problem but never happen under QNAP with unRAID.

Model family:	Western Digital Green
Device model:	WDC WD30EZRX-00DC0B0
Serial number:	WD-WMC1T3342970
LU WWN device id:	5 0014ee 058df2227
Firmware version:	80.00A80
User capacity:	3,000,592,982,016 bytes [3.00 TB]
Sector sizes:	512 bytes logical, 4096 bytes physical
Device:	In smartctl database [for details use: -P show]
ATA version:	ACS-2 (minor revision not indicated)
SATA version:	SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local time:	Mon Jul 31 00:37:27 2017 CST

194	Temperature celsius	0x0022	122	102	Old age	Always	Never	28
196	Reallocated event count	0x0032	200	200	Old age	Always	Never	0
197	Current pending sector	0x0032	200	200	Old age	Always	Never	0
198	Offline uncorrectable	0x0030	200	200	Old age	Offline	Never	0
199	UDMA CRC error count	0x0032	200	200	Old age	Always	Never	14
200	Multi zone error rate	0x0008	200	200	Old age	Offline	Never	0

Model family:	Seagate Barracuda 7200.14 (AF)
Device model:	ST3000DM001-1CH166
Serial number:	W1F2VKGA
LU WWN device id:	5 000c50 060948b00
Firmware version:	CC24
User capacity:	3,000,592,982,016 bytes [3.00 TB]
Sector sizes:	512 bytes logical, 4096 bytes physical
Rotation rate:	7200 rpm
Form factor:	3.5 inches
Device:	In smartctl database [for details use: -P show]
ATA version:	ATA8-ACS T13/1699-D revision 4
SATA version:	SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local time:	Mon Jul 31 00:35:15 2017 CST

194	Temperature celsius	0x0022	029	084	Old age	Always	Never	29 (0 11 0 0 0)
197	Current pending sector	0x0012	100	100	Old age	Always	Never	0
198	Offline uncorrectable	0x0010	100	100	Old age	Offline	Never	0
199	UDMA CRC error count	0x003e	200	200	Old age	Always	Never	14
240	Head flying hours	0x0000	100	253	Old age	Offline	Never	3081h+49m+35.636s
241	Total lbas written	0x0000	100	253	Old age	Offline	Never	93801331818
242	Total lbas read	0x0000	100	253	Old age	Offline	Never	260885704609

Model family:	Western Digital Green
Device model:	WDC WD30EZRX-00DC0B0
Serial number:	WD-WMC1T2755139
LU WWN device id:	5 0014ee 603307d59
Firmware version:	80.00A80
User capacity:	3,000,592,982,016 bytes [3.00 TB]
Sector sizes:	512 bytes logical, 4096 bytes physical
Device:	In smartctl database [for details use: -P show]
ATA version:	ACS-2 (minor revision not indicated)
SATA version:	SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local time:	Mon Jul 31 00:41:24 2017 CST

194	Temperature celsius	0x0022	118	105	Old age	Always	Never	32
196	Reallocated event count	0x0032	200	200	Old age	Always	Never	0
197	Current pending sector	0x0032	200	200	Old age	Always	Never	0
198	Offline uncorrectable	0x0030	200	200	Old age	Offline	Never	0
199	UDMA CRC error count	0x0032	200	200	Old age	Always	Never	4
200	Multi zone error rate	0x0008	200	200	Old age	Offline	Never	0

Model family:	Toshiba 3.5" DT01ACA... Desktop HDD
Device model:	TOSHIBA DT01ACA300
Serial number:	25EEEPXGS
LU WWN device id:	5 000039 ff4f062ca
Firmware version:	MX6OABB0
User capacity:	3,000,592,982,016 bytes [3.00 TB]
Sector sizes:	512 bytes logical, 4096 bytes physical
Rotation rate:	7200 rpm
Form factor:	3.5 inches
Device:	In smartctl database [for details use: -P show]
ATA version:	ATA8-ACS T13/1699-D revision 4
SATA version:	SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local time:	Mon Jul 31 00:43:56 2017 CST

194	Temperature celsius	0x0002	187	187	Old age	Always	Never	32 (min/max 13/47)
196	Reallocated event count	0x0032	100	100	Old age	Always	Never	0
197	Current pending sector	0x0022	100	100	Old age	Always	Never	0
198	Offline uncorrectable	0x0008	100	100	Old age	Offline	Never	0
199	UDMA CRC error count	0x000a	200	200	Old age	Always	Never	26

---------------------------------------------------------------

Another Toshiba disk, no counter 199 error

Model family:	Toshiba 3.5" DT01ACA... Desktop HDD
Device model:	TOSHIBA DT01ACA300
Serial number:	66T1SL8AS
LU WWN device id:	5 000039 fe6c0ccec
Firmware version:	MX6OABB0
User capacity:	3,000,592,982,016 bytes [3.00 TB]
Sector sizes:	512 bytes logical, 4096 bytes physical
Rotation rate:	7200 rpm
Form factor:	3.5 inches
Device:	In smartctl database [for details use: -P show]
ATA version:	ATA8-ACS T13/1699-D revision 4
SATA version:	SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local time:	Mon Jul 31 00:46:37 2017 CST

194	Temperature celsius	0x0002	187	187	Old age	Always	Never	32 (min/max 20/48)
196	Reallocated event count	0x0032	100	100	Old age	Always	Never	0
197	Current pending sector	0x0022	100	100	Old age	Always	Never	0
198	Offline uncorrectable	0x0008	100	100	Old age	Offline	Never	0
199	UDMA CRC error count	0x000a	200	200	Old age	Always	Never	0

Vr2Io · July 30, 2017

All PCIe device should be under a PLX PCIe bridge as J1800 only have total 4 PCIe lane

PCI Devices and IOMMU Groups

[8086:0f00] 00:00.0 Host bridge: Intel Corporation Atom Processor Z36xxx/Z37xxx Series SoC Transaction Register (rev 0e)
[8086:0f31] 00:02.0 VGA compatible controller: Intel Corporation Atom Processor Z36xxx/Z37xxx Series Graphics & Display (rev 0e)
[8086:0f15] 00:11.0 SD Host controller: Intel Corporation Atom Processor Z36xxx/Z37xxx Series SDIO Controller (rev 0e)
[8086:0f16] 00:12.0 SD Host controller: Intel Corporation Atom Processor Z36xxx/Z37xxx Series SDIO Controller (rev 0e)
[8086:0f23] 00:13.0 SATA controller: Intel Corporation Atom Processor E3800 Series SATA AHCI Controller (rev 0e)
[8086:0f35] 00:14.0 USB controller: Intel Corporation Atom Processor Z36xxx/Z37xxx, Celeron N2000 Series USB xHCI (rev 0e)
[8086:0f50] 00:17.0 SD Host controller: Intel Corporation Atom Processor E3800 Series eMMC 4.5 Controller (rev 0e)
[8086:0f40] 00:18.0 DMA controller: Intel Corporation Atom Processor Z36xxx/Z37xxx Series LPIO2 DMA Controller (rev 0e)
[8086:0f41] 00:18.1 Serial bus controller [0c80]: Intel Corporation Atom Processor Z36xxx/Z37xxx Series LPIO2 I2C Controller #1 (rev 0e)
[8086:0f42] 00:18.2 Serial bus controller [0c80]: Intel Corporation Atom Processor Z36xxx/Z37xxx Series LPIO2 I2C Controller #2 (rev 0e)
[8086:0f43] 00:18.3 Serial bus controller [0c80]: Intel Corporation Atom Processor Z36xxx/Z37xxx Series LPIO2 I2C Controller #3 (rev 0e)
[8086:0f44] 00:18.4 Serial bus controller [0c80]: Intel Corporation Atom Processor Z36xxx/Z37xxx Series LPIO2 I2C Controller #4 (rev 0e)
[8086:0f45] 00:18.5 Serial bus controller [0c80]: Intel Corporation Atom Processor Z36xxx/Z37xxx Series LPIO2 I2C Controller #5 (rev 0e)
[8086:0f46] 00:18.6 Serial bus controller [0c80]: Intel Corporation Atom Processor Z36xxx/Z37xxx Series LPIO2 I2C Controller #6 (rev 0e)
[8086:0f47] 00:18.7 Serial bus controller [0c80]: Intel Corporation Atom Processor Z36xxx/Z37xxx Series LPIO2 I2C Controller #7 (rev 0e)
[8086:0f18] 00:1a.0 Encryption controller: Intel Corporation Atom Processor Z36xxx/Z37xxx Series Trusted Execution Engine (rev 0e)
[8086:0f04] 00:1b.0 Audio device: Intel Corporation Atom Processor Z36xxx/Z37xxx Series High Definition Audio Controller (rev 0e)
[8086:0f48] 00:1c.0 PCI bridge: Intel Corporation Atom Processor E3800 Series PCI Express Root Port 1 (rev 0e)
[8086:0f4a] 00:1c.1 PCI bridge: Intel Corporation Atom Processor E3800 Series PCI Express Root Port 2 (rev 0e)
[8086:0f4c] 00:1c.2 PCI bridge: Intel Corporation Atom Processor E3800 Series PCI Express Root Port 3 (rev 0e)
[8086:0f4e] 00:1c.3 PCI bridge: Intel Corporation Atom Processor E3800 Series PCI Express Root Port 4 (rev 0e)
[8086:0f06] 00:1e.0 DMA controller: Intel Corporation Atom Processor Z36xxx/Z37xxx Series LPIO1 DMA Controller (rev 0e)
[8086:0f08] 00:1e.1 Serial bus controller [0c80]: Intel Corporation Atom Processor Z36xxx/Z37xxx Series LPIO1 PWM Controller (rev 0e)
[8086:0f09] 00:1e.2 Serial bus controller [0c80]: Intel Corporation Atom Processor Z36xxx/Z37xxx Series LPIO1 PWM Controller (rev 0e)
[8086:0f0a] 00:1e.3 Communication controller: Intel Corporation Atom Processor Z36xxx/Z37xxx Series LPIO1 HSUART Controller #1 (rev 0e)
[8086:0f0c] 00:1e.4 Communication controller: Intel Corporation Atom Processor Z36xxx/Z37xxx Series LPIO1 HSUART Controller #2 (rev 0e)
[8086:0f0e] 00:1e.5 Serial bus controller [0c80]: Intel Corporation Atom Processor Z36xxx/Z37xxx Series LPIO1 SPI Controller (rev 0e)
[8086:0f1c] 00:1f.0 ISA bridge: Intel Corporation Atom Processor Z36xxx/Z37xxx Series Power Control Unit (rev 0e)
[8086:0f12] 00:1f.3 SMBus: Intel Corporation Atom Processor E3800 Series SMBus Controller (rev 0e)
[1b4b:9215] 01:00.0 SATA controller: Marvell Technology Group Ltd. Device 9215 (rev 11)
[1b4b:9215] 02:00.0 SATA controller: Marvell Technology Group Ltd. Device 9215 (rev 11)
[10b5:8603] 03:00.0 PCI bridge: PLX Technology, Inc. PEX 8603 3-lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev ab)
[10b5:8603] 04:01.0 PCI bridge: PLX Technology, Inc. PEX 8603 3-lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev ab)
[10b5:8603] 04:02.0 PCI bridge: PLX Technology, Inc. PEX 8603 3-lane, 3-Port PCI Express Gen 2 (5.0 GT/s) Switch (rev ab)
[8086:1533] 05:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
[8086:1533] 06:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)

Edited July 30, 2017 by Benson

BobPhoenix · July 31, 2017

Here is the smart report for a drive I had problems with on my MB Marvel 9230 controller. I had the marvel controller passed through to a WHSv1 VM and it would drop this or one of the other 3 identical drives all the time. I had to reboot the server to get the controller back. That is why I first got a LSI 9201-16i controller. That way I could pass one of the other MB controllers through to my WHS v1 VM. I see it isn't exactly in the best shape but I am only recording local news on it now so not terribly important to me and the last ATA error was at ~1/3 it's current age.

smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.9.30-unRAID] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD20EFRX-68AX9N0
Serial Number:    WD-WMC300xxxxxxx
LU WWN Device Id: 5 0014ee 6adbc128d
Firmware Version: 80.00A80
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Jul 30 20:33:13 2017 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(25440) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 257) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x70bd)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       398
  3 Spin_Up_Time            0x0027   165   163   021    Pre-fail  Always       -       4741
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1441
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   052   052   000    Old_age   Always       -       35132
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       199
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       135
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       1305
194 Temperature_Celsius     0x0022   121   106   000    Old_age   Always       -       26
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 38788 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 38788 occurred at disk power-on lifetime: 10288 hours (428 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 0c 00 00 00 00  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 03 0c 00 00 00 00 00      21:01:08.072  SET FEATURES [Set transfer mode]
  e5 00 00 00 00 00 00 00      21:01:08.072  CHECK POWER MODE
  ec 00 00 00 00 00 00 00      21:01:08.072  IDENTIFY DEVICE
  ef 03 0c 00 00 00 00 00      21:01:07.822  SET FEATURES [Set transfer mode]
  e5 00 00 00 00 00 00 00      21:01:07.822  CHECK POWER MODE

Error 38787 occurred at disk power-on lifetime: 10288 hours (428 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 0c 00 00 00 00  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 03 0c 00 00 00 00 00      21:01:07.822  SET FEATURES [Set transfer mode]
  e5 00 00 00 00 00 00 00      21:01:07.822  CHECK POWER MODE
  ec 00 00 00 00 00 00 00      21:01:07.822  IDENTIFY DEVICE
  ef 03 0c 00 00 00 00 00      21:01:07.573  SET FEATURES [Set transfer mode]
  e5 00 00 00 00 00 00 00      21:01:07.573  CHECK POWER MODE

Error 38786 occurred at disk power-on lifetime: 10288 hours (428 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 0c 00 00 00 00  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 03 0c 00 00 00 00 00      21:01:07.573  SET FEATURES [Set transfer mode]
  e5 00 00 00 00 00 00 00      21:01:07.573  CHECK POWER MODE
  ec 00 00 00 00 00 00 00      21:01:07.573  IDENTIFY DEVICE
  ef 03 0c 00 00 00 00 00      21:01:07.323  SET FEATURES [Set transfer mode]
  e5 00 00 00 00 00 00 00      21:01:07.323  CHECK POWER MODE

Error 38785 occurred at disk power-on lifetime: 10288 hours (428 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 0c 00 00 00 00  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 03 0c 00 00 00 00 00      21:01:07.323  SET FEATURES [Set transfer mode]
  e5 00 00 00 00 00 00 00      21:01:07.323  CHECK POWER MODE
  ec 00 00 00 00 00 00 00      21:01:07.322  IDENTIFY DEVICE
  ef 03 0c 00 00 00 00 00      21:01:07.074  SET FEATURES [Set transfer mode]
  e5 00 00 00 00 00 00 00      21:01:07.074  CHECK POWER MODE

Error 38784 occurred at disk power-on lifetime: 10288 hours (428 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 0c 00 00 00 00  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 03 0c 00 00 00 00 00      21:01:07.074  SET FEATURES [Set transfer mode]
  e5 00 00 00 00 00 00 00      21:01:07.074  CHECK POWER MODE
  ec 00 00 00 00 00 00 00      21:01:07.073  IDENTIFY DEVICE
  ef 03 0c 00 00 00 00 00      21:01:07.073  SET FEATURES [Set transfer mode]
  e5 00 00 00 00 00 00 00      21:01:07.073  CHECK POWER MODE

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Edited July 31, 2017 by BobPhoenix

Squid · July 31, 2017

Its possible that simply having an ATA8-ACS installed would cause another drive to possibly drop. We don't know, just like we don't know in my tests what the origin of the 5 parity check errors was.

HellDiverUK · July 31, 2017

I've never had any problems with the SAS2LP cards on unRAID. I tend to use WD drives, though. The Seagates I have are 6TB Ironwolfs, 8TB Archives and the venerable 4TB Desktop ST4000DM000.

srfnmnk · October 8, 2017

Thanks for the analysis @Squid I am suffering from this issue too. I wanted to let you know that I during parity check I did have the same issue with ACS-2, ACS-3 T13/2161-D revision 3b. The drive dropped off just like the ATA8-ACS did. Also, I have a parity disk that is Toshiba using ATA8-ACS that also dropped off.

HellDiverUK · October 10, 2017

Marvel issues? I blame Iron Man.

Marvell issues on the other hand are probably the binary blobs or bad firmware versions. For example, I have issues with an elcheapo Marvell card, yet the identical chipset soldered to a Supermicro board has no issues at all.

Marvel Issues: Starting Point For Investigation

Recommended Posts

Squid

Link to comment

JorgeB

Link to comment

Squid

Link to comment

JorgeB

Link to comment

Vr2Io

Link to comment

Squid

Link to comment

Vr2Io

Link to comment

Vr2Io

Link to comment

BobPhoenix

Link to comment

Squid

Link to comment

HellDiverUK

Link to comment

srfnmnk

Link to comment

HellDiverUK

Link to comment

Join the conversation