5.0.5 System Shut Itself Down (Apparently Clean); Strange Behavior Upon Reboot (Possible 2Disk Failure)


Recommended Posts

System: Supermicro - X8SIL
CPU: Intel® Core™ i3 CPU 540 @ 3.07GHz
Cache: 128 kB, 512 kB, 4096 kB
Memory: 32768 MB (max. 16 GB)
Network: eth0: 1000Mb/s - Full Duplex
 
I was planning on upgrading my 19-array 1-cache 1-parity-drive tower from 5.0.5 (last upgrade I'd made) to a current version when I replaced the cache drive (which had started running hot late last year), but life got in the way on both fronts.  I hope that delay doesn't prove fatal today!
 
I've been away from home for about a day, and moved about 10gb of files from one network location to my now-strange Tower's cache drive, where they moved into their destination array locations seemingly without a hitch.  A few minutes after confirming the files were no longer in my cache directory, I got an SMB disconnect error from windows.  Went to check the UnMenu GUI, and that wouldn't load; tried the UnRaid GUI next, and it loaded but said the array was offline.  I went straight for the system log button, but the GUI locked and nothing loaded after the log window popped up; within a few minutes, the Tower beeped and what appeared to be a clean shutdown had completed.
 
I started the system back up (log attached) and UnMenu started displaying odd information right away.  It showed 8, then 6, "protected array drives" up top, and the remainder of my drives down below in the "Not In Protected Array" section.  These remaining drives were each listed with their /dev/sd* name, but each one also had a /dev/sd*1 dupe beneath it.  Over the course of a few minutes, those "Not in Protected Array" drives shifted up to the top section, leaving only the cache drive not protected (as expected), but keeping the /dev/sdf & /dev/sdf1 duplicate listing for that cache drive.
 
During this entire drive-shifting process over on UnMenu, UnRaid GUI seemed totally normal, like all was well (a few temperature shift notifications notwithstanding).  I can understand just enough of the log to be dangerous, and it sure reads like something not-normal is going on here, but I'll be damned if I can figure out what.  As of right now, I don't want to touch a thing until I have a better idea of what is happening, so I figured I'd post immediately and see if anyone has any ideas of how best (and most safely) to proceed.
 
Thanks a ton to anyone for help or guidance if possible!
 
ETA: I have *not* yet tried restarting the array since the reboot, despite "Parity Is Valid" and "Configuration Valid" assurances.
 

syslog-2017-05-19.txt

Edited by wheel
Link to comment

Thanks for checking and responding so quickly!  I'll give booting the array a shot, mess around for a little while to make sure nothing strange pops up in GUIs or SMB, and if all seems well, go for a power down and replacement of the cache drive.  Anything funky comes up, I'll be back here with logs.

 

Thank you again!

Link to comment

So yesterday got pretty busy and the cache drive swap didn't happen, but nothing strange popped up on UnMenu or UnRaid GUI either; stayed out for the night, and just got back home.

 

Everything looks fine on UnMenu, but I notice the syslog summary seems sparse.  Went to check the entire thing, and this is all it says:

 

Display syslog /var/log/syslog        Click Here to Download Complete /var/log/syslog

May 20 04:40:01 Tower syslogd 1.4.1: restart.
May 20 09:58:53 Tower dhcpcd[1285]: eth0: renewing lease of 192.168.1.10 (Network)
May 20 09:58:53 Tower dhcpcd[1285]: eth0: acknowledged 192.168.1.10 from 192.168.1.1 (Network)
May 20 09:58:53 Tower dhcpcd[1285]: eth0: leased 192.168.1.10 for 86400 seconds (Network)

Total Lines: 4

 

 

Should I be concerned that my system seems to be shutting itself down and restarting its syslogs on its own volition?

Edited by wheel
Link to comment
1 hour ago, wheel said:

restarting its syslogs on its own volition?

The log has rotated.  There's another "archived" syslog that'll show up in diagnostics

 

EDIT: Just noticed that you're on 5.0.5 so no diagnostics available.

 

The other log files are within /var/log, but you'll have to get them from a command prompt.

Edited by Squid
Link to comment

This is probably a stupid question, but was the 5.0.5 system putting itself into controlled shutdown without any prompting (or even unrelated action, that I could tell) from my end normal?  If not, are there any signs of what could have caused that in the log?

 

Right now, my plan's to run a parity check to be safe, followed by a cache drive replacement.  At that point, I'll continue moving files to the array, but hate to add things right now if there's a chance anything could go wrong.

 

I'll start the parity check before I fall asleep tonight unless I'm advised to do otherwise on here first!

Link to comment

Glad I ran a non-correcting parity check; looks like Disk 5 died during the check (currently disabled), and I have a warm spare ready to go.

 

Only thing that throws me off is towards the bottom of the log, after a series of read error messages from Disk 5, there's a set of notifications that seem like they're having problems reading from Disk 10 as well (right before it shuts down the parity check).  I'm hoping against hope that these Disk 10 read errors are because UnRaid was trying to check that disk against the parity drive for Disk 10 purposes, or it was coming up next or something, and that two drives did not die simultaneously / are currently disabled (sure doesn't look that way on UnMenu or UnRaid GUI, but I'm a little freaked out right now).

 

I can't attach the parity check / error log for some reason, so I'm pasting the lines that concern me below - my plan's to install the replacement drive and let everything rebuild as normal, but does anyone see anything I should do before I follow those steps?

 

I'm guessing this is where Disk 5 died:

May 24 04:12:16 Tower kernel: md: disk5 read error, sector=4070714128 (Errors) May 24 04:12:19 Tower kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors) May 24 04:12:19 Tower kernel: ata2.00: BMDMA stat 0x65 (Drive related) May 24 04:12:19 Tower kernel: ata2.00: failed command: READ DMA EXT (Minor Issues) May 24 04:12:19 Tower kernel: ata2.00: cmd 25/00:00:58:2b:a2/00:04:f2:00:00/e0 tag 0 dma 524288 in (Drive related) May 24 04:12:19 Tower kernel: res 51/40:00:58:2b:a2/40:00:f2:00:00/00 Emask 0x9 (media error) (Errors) May 24 04:12:19 Tower kernel: ata2.00: status: { DRDY ERR } (Drive related) May 24 04:12:19 Tower kernel: ata2.00: error: { UNC } (Errors) May 24 04:12:19 Tower kernel: ata2.00: configured for UDMA/133 (Drive related) May 24 04:12:19 Tower kernel: ata2.01: configured for UDMA/133 (Drive related) May 24 04:12:19 Tower kernel: sd 3:0:0:0: [sdd] Unhandled sense code (Drive related) May 24 04:12:19 Tower kernel: sd 3:0:0:0: [sdd] (Drive related) May 24 04:12:19 Tower kernel: Result: hostbyte=0x00 driverbyte=0x08 (System) May 24 04:12:19 Tower kernel: sd 3:0:0:0: [sdd] (Drive related) May 24 04:12:19 Tower kernel: Sense Key : 0x3 [current] [descriptor] May 24 04:12:19 Tower kernel: Descriptor sense data with sense descriptors (in hex): May 24 04:12:19 Tower kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 May 24 04:12:19 Tower kernel: f2 a2 2b 58 May 24 04:12:19 Tower kernel: sd 3:0:0:0: [sdd] (Drive related) May 24 04:12:19 Tower kernel: ASC=0x11 ASCQ=0x4 May 24 04:12:19 Tower kernel: sd 3:0:0:0: [sdd] CDB: (Drive related) May 24 04:12:19 Tower kernel: cdb[0]=0x88: 88 00 00 00 00 00 f2 a2 2b 58 00 00 04 00 00 00 May 24 04:12:19 Tower kernel: end_request: I/O error, dev sdd, sector 4070714200 (Errors) May 24 04:12:19 Tower kernel: ata2: EH complete (Drive related) May 24 04:12:19 Tower kernel: md: disk5 read error, sector=4070714136 (Errors) May 24 04:13:20 Tower kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen (Errors) May 24 04:13:20 Tower kernel: ata2.00: failed command: FLUSH CACHE EXT (Minor Issues) May 24 04:13:20 Tower kernel: ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 (Drive related) May 24 04:13:20 Tower kernel: res 40/00:00:58:2b:a2/40:00:f2:00:00/00 Emask 0x4 (timeout) (Errors) May 24 04:13:20 Tower kernel: ata2.00: status: { DRDY } (Drive related) May 24 04:13:20 Tower kernel: ata2.00: hard resetting link (Minor Issues) May 24 04:13:21 Tower kernel: ata2.01: hard resetting link (Minor Issues) May 24 04:13:26 Tower kernel: ata2.00: link is slow to respond, please be patient (ready=0) (Drive related) May 24 04:13:30 Tower kernel: ata2.00: SRST failed (errno=-16) (Minor Issues) May 24 04:13:30 Tower kernel: ata2.00: hard resetting link (Minor Issues) May 24 04:13:31 Tower kernel: ata2.01: hard resetting link (Minor Issues) May 24 04:13:36 Tower kernel: ata2.00: link is slow to respond, please be patient (ready=0) (Drive related) May 24 04:13:40 Tower kernel: ata2.00: SRST failed (errno=-16) (Minor Issues) May 24 04:13:40 Tower kernel: ata2.00: hard resetting link (Minor Issues) May 24 04:13:41 Tower kernel: ata2.01: hard resetting link (Minor Issues) May 24 04:13:46 Tower kernel: ata2.00: link is slow to respond, please be patient (ready=0) (Drive related) May 24 04:14:15 Tower kernel: ata2.00: SRST failed (errno=-16) (Minor Issues) May 24 04:14:15 Tower kernel: ata2.01: limiting SATA link speed to 1.5 Gbps (Drive related) May 24 04:14:15 Tower kernel: ata2.00: hard resetting link (Minor Issues) May 24 04:14:16 Tower kernel: ata2.01: hard resetting link (Minor Issues) May 24 04:14:20 Tower kernel: ata2.00: SRST failed (errno=-16) (Minor Issues) May 24 04:14:20 Tower kernel: ata2.00: reset failed, giving up (Minor Issues) May 24 04:14:20 Tower kernel: ata2.00: disabled (Errors) May 24 04:14:20 Tower kernel: ata2.01: disabled (Errors) May 24 04:14:20 Tower kernel: ata2: EH complete (Drive related) May 24 04:14:20 Tower kernel: md: disk5 read error, sector=4070714144 (Errors) May 24 04:14:20 Tower kernel: md: disk5 read error, sector=4070714152 (Errors) May 24 04:14:20 Tower kernel: md: disk5 read error, sector=4070714160 (Errors) May 24 04:14:20 Tower kernel: md: disk5 read error, sector=4070714168 (Errors) May 24 04:14:20 Tower kernel: md: disk5 read error, sector=4070714176 (Errors) May 24 04:14:20 Tower kernel: md: disk5 read error, sector=4070714184 (Errors) May 24 04:14:20 Tower kernel: md: disk5 read error, sector=4070714192 (Errors) May 24 04:14:20 Tower kernel: md: disk5 read error, sector=4070714200 (Errors) May 24 04:14:20 Tower kernel: md: disk5 read error, sector=4070714208 (Errors) May 24 04:14:20 Tower kernel: sd 3:0:0:0: [sdd] READ CAPACITY(16) failed (Drive related) May 24 04:14:20 Tower kernel: sd 3:0:0:0: [sdd] (Drive related) May 24 04:14:20 Tower kernel: Result: hostbyte=0x04 driverbyte=0x00 (System) May 24 04:14:20 Tower kernel: sd 3:0:0:0: [sdd] Sense not available. (Drive related) May 24 04:14:20 Tower kernel: md: disk5 read error, sector=4070714216 (Errors) May 24 04:14:20 Tower kernel: sd 3:0:0:0: [sdd] READ CAPACITY failed (Drive related) May 24 04:14:20 Tower kernel: md: disk5 read error, sector=4070714224 (Errors) May 24 04:14:20 Tower kernel: sd 3:0:0:0: [sdd] (Drive related) May 24 04:14:20 Tower kernel: Result: hostbyte=0x04 driverbyte=0x00 (System) May 24 04:14:20 Tower kernel: sd 3:0:0:0: [sdd] Sense not available. (Drive related) May 24 04:14:20 Tower kernel: md: disk5 read error, sector=4070714232 (Errors) May 24 04:14:20 Tower kernel: md: disk5 read error, sector=4070714240 (Errors) May 24 04:14:20 Tower kernel: sd 3:0:0:0: [sdd] Asking for cache data failed (Drive related) May 24 04:14:20 Tower kernel: sd 3:0:0:0: [sdd] Assuming drive cache: write through (Drive related) May 24 04:14:20 Tower kernel: sdd: detected capacity change from 3000592982016 to 0 (Drive related) May 24 04:14:20 Tower kernel: md: disk5 read error, sector=4070714248 (Errors)

 

Then, after a series of these:

May 24 04:14:20 Tower kernel: md: disk5 read error, sector=4070715088 (Errors)

 

I got these:

May 24 04:14:20 Tower kernel: sd 3:0:1:0: [sde] Unhandled error code (Errors) May 24 04:14:20 Tower kernel: sd 3:0:1:0: [sde] (Drive related) May 24 04:14:20 Tower kernel: Result: hostbyte=0x04 driverbyte=0x00 (System) May 24 04:14:20 Tower kernel: sd 3:0:1:0: [sde] CDB: (Drive related) May 24 04:14:20 Tower kernel: md: disk5 read error, sector=4070715096 (Errors) May 24 04:14:20 Tower kernel: cdb[0]=0x88: 88 00 00 00 00 00 f2 a2 2f 58 00 00 04 00 00 00 May 24 04:14:20 Tower kernel: end_request: I/O error, dev sde, sector 4070715224 (Errors)

 

And then back to the "disk 5 read error" series, until interrupted by this:

May 24 04:14:20 Tower kernel: md: disk5 write error, sector=4070714136 (Errors) May 24 04:14:20 Tower kernel: md: disk5 write error, sector=4070714144 (Errors) May 24 04:14:20 Tower kernel: md: md_do_sync: got signal, exit... (unRAID engine) May 24 04:14:20 Tower kernel: md: disk5 write error, sector=4070714152 (Errors) May 24 04:14:20 Tower kernel: md: disk5 write error, sector=4070714160 (Errors)

 

Which continues on with the 'disk 5 write error" series, but gets interrupted like this:

May 24 04:14:20 Tower kernel: sd 3:0:1:0: [sde] Unhandled error code (Errors) May 24 04:14:20 Tower kernel: md: disk5 write error, sector=4070714528 (Errors) May 24 04:14:20 Tower kernel: sd 3:0:1:0: [sde] (Drive related) May 24 04:14:20 Tower kernel: Result: hostbyte=0x04 driverbyte=0x00 (System) May 24 04:14:20 Tower kernel: md: disk5 write error, sector=4070714536 (Errors) May 24 04:14:20 Tower kernel: sd 3:0:1:0: [sde] CDB: (Drive related) May 24 04:14:20 Tower kernel: md: disk5 write error, sector=4070714544 (Errors) May 24 04:14:20 Tower kernel: md: disk5 write error, sector=4070714552 (Errors) May 24 04:14:20 Tower kernel: md: disk5 write error, sector=4070714560 (Errors) May 24 04:14:20 Tower kernel: md: disk5 write error, sector=4070714568 (Errors) May 24 04:14:20 Tower kernel: cdb[0]=0x88: 88 00 00 00 00 00 f2 a2 33 58 00 00 01 40 00 00 May 24 04:14:20 Tower kernel: md: disk5 write error, sector=4070714576 (Errors) May 24 04:14:20 Tower kernel: end_request: I/O error, dev sde, sector 4070716248 (Errors)

 

Then after getting back to "disk 5 write error" repetition, I get this, which is what concerns me:

May 24 04:14:20 Tower kernel: md: disk5 write error, sector=4070715152 (Errors) May 24 04:14:20 Tower kernel: md: disk5 read error, sector=4070715160 (Errors) May 24 04:14:20 Tower kernel: md: disk10 read error, sector=4070715160 (Errors) May 24 04:14:20 Tower kernel: md: multiple disk errors, sector=4070715160 (Errors) May 24 04:14:20 Tower kernel: md: disk5 read error, sector=4070715168 (Errors) May 24 04:14:20 Tower kernel: md: disk10 read error, sector=4070715168 (Errors) May 24 04:14:20 Tower kernel: md: multiple disk errors, sector=4070715168 (Errors)

 

The logger appears to stop the "multiple disk errors" logging at this point, and these last two lines (disk 5, disk 10) repeat until the end:

May 24 04:14:20 Tower kernel: md: multiple disk errors, stopped logging (Errors) May 24 04:14:20 Tower kernel: md: disk5 read error, sector=4070715472 (Errors) May 24 04:14:20 Tower kernel: md: disk10 read error, sector=4070715472 (Errors)

 

And here's the end:

May 24 04:14:20 Tower kernel: md: disk5 read error, sector=4070716496 (Errors) May 24 04:14:20 Tower kernel: md: disk10 read error, sector=4070716496 (Errors) May 24 04:14:21 Tower kernel: md: recovery thread sync completion status: -4 (unRAID engine) May 24 04:14:21 Tower kernel: md: recovery thread woken up ... (unRAID engine) May 24 04:14:21 Tower kernel: md: recovery thread has nothing to resync (unRAID engine)

 

 

Edited by wheel
Hopefully fixed log formatting!
Link to comment

Hate to tempt fate, but just checked directory structure of a couple multi-drive user shares through Windows and nothing *seems* amiss... really hoping it's just one drive and the Disk 10 read errors were just parity checking one drive against the other!

Link to comment

OK guys, I really want to shut this thing down now, but hate to do that if I'm losing something I may need to fix a multiple drive failure.

 

Log is throwing these out at me now:

 

) May 24 05:09:48 Tower kernel: cdb[0]=0x8a: 8a 00 00 00 00 01 bf d7 af 18 00 00 00 08 00 00 May 24 05:09:48 Tower kernel: md: disk10 write error, sector=7513550552 (Errors) May 24 05:09:48 Tower kernel: sd 3:0:1:0: [sde] Unhandled error code (Errors) May 24 05:09:48 Tower kernel: sd 3:0:1:0: [sde] (Drive related) May 24 05:09:48 Tower kernel: Result: hostbyte=0x04 driverbyte=0x00 (System) May 24 05:09:48 Tower kernel: sd 3:0:1:0: [sde] CDB: (Drive related)

 

Like it keeps wanting to (and failing to) write to Drive 10.  Still no DISK DISABLED on UnMenu or UnRaid GUI for Disk 10.

 

DEFINITELY getting scared now though.

Link to comment

Then I'm screwed.  Every time I try to load a syslog I get an error message from this site halfway through the upload, " There was a problem processing the uploaded file. -200 "

 

UnRaid GUI shows a thumbs up icon for both Disk 5 and Disk 10 under the SMART Status windows for each; going to try checking via UnMenu now.

 

These most recent (and now apparently looping) lines make me feel like something is irrevocably screwed:

 

May 24 05:13:09 Tower kernel: REISERFS error (device md10): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [6920 8521 0x0 SD]
May 24 05:13:09 Tower kernel: REISERFS error (device md10): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [6920 9047 0x0 SD]
May 24 05:13:09 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
May 24 05:13:40 Tower last message repeated 218 times
May 24 05:14:41 Tower last message repeated 364 times
May 24 05:15:42 Tower last message repeated 366 times

Edited by wheel
Link to comment

OK, so it's looking like two disks are dead (unless and until I can upload a 5-24 full log to the site, which doesn't look like it's happening now) from the pair of SMART reports below.  Other random drives I selected deliver normal SMART reports.  Considering I made it through nearly a decade of UnRAID usage without a dual-drive failure, I guess things could've been worse.

 

Does anyone recommend I leave the system on until I can successfully upload a syslog here, or should I just go ahead and start the giving up process of figuring out what (if anything) I can salvage from these two drives and how best to get back to a parity protected situation on the remaining drives?

 

Going to be a fun day / week.

 

smartctl -a -d ata /dev/sdd
smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

Read Device Identity failed: Input/output error

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

 

 

and

 

smartctl -a -d ata /dev/sde
smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

Read Device Identity failed: Input/output error

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
Link to comment

Disk5 looks real bad, SMART report would probably confirm.

 

Disk10 should be OK but dropped offline together with disk5 because your controller is in IDE mode, it should be in AHCI.

 

You should powerdown, replace disk5, change controller to AHCI  and attempt a rebuild.

Link to comment

Sounds like a plan.  Just checked SMART reports on all other drives and didn't see anything weird except for the non-response from 5 and 10.

 

Possibly tricky question: either my controller has always been in IDE mode or UnRaid somehow changed it for me (I definitely haven't messed with that setting in the BIOS at any time); if I need to change that setting, is there a way to do so from UnRaid, or do I need to hook a monitor up and change it myself through BIOS before UnRaid boots?

Link to comment

Amazingly enough, I think things are back to normal.  Disk 5 appeared to rebuild correctly (spot-check on a few files seems to confirm) and parity check's only a few minutes away from completion with no issues.

 

Thanks again, johnnie.black!

 

Now I'm off to research which stable build of unraid that supports long-term syslog recording is a good fit for my current needs and upgrade before I do anything else - maybe get to the bottom of that self-shutoff issue that way.  Damn glad I didn't lose a couple disks' worth of data getting there!

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.