[Systems] Ailing drive on housetree
Bernie Innocenti
bernie at sugarlabs.org
Sat Nov 12 01:18:19 EST 2011
On Fri, 2011-11-11 at 23:25 -0500, Chris Ball wrote:
> Hi Bernie,
>
> On Fri, Nov 11 2011, Bernie Innocenti wrote:
> > Today I finally figured out why housetree was reporting high load
> > occasionally without any apparent activity on the VMs.
> >
> > It turns out that sdb is dying, and we didn't even have smartd running.
> > Luckily, sdb is part of RAID1 arrays with triple-redundancy. We could
> > continue operating with 2 drives for some time, but I'd feel safer if we
> > replaced the drive as soon as possible. After all, the remaining drives
> > are the same model and have been operating for the same amount of time
> > (806 days).
>
> Dumb question for my own curiosity: what in the smartctl output tells you
> that the disk is failing, compared to when run against the other disks?
I should have published the diffs:
housetree:~# diff -up sda sdb
--- sda 2011-11-11 21:12:15.083206106 -0500
+++ sdb 2011-11-11 21:12:18.183198752 -0500
@@ -4,13 +4,13 @@ Copyright (C) 2002-10 by Bruce Allen, ht
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.11 family
Device Model: ST31000333AS
-Serial Number: 9TE1F22A
+Serial Number: 9TE1DRMJ
Firmware Version: CC3H
User Capacity: 1,000,204,886,016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
-Local Time is: Fri Nov 11 21:12:14 2011 EST
+Local Time is: Fri Nov 11 21:12:17 2011 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
@@ -43,7 +43,7 @@ Error logging capability: (0x01)
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
-recommended polling time: ( 202) minutes.
+recommended polling time: ( 206) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x103f) SCT Status supported.
@@ -54,30 +54,128 @@ SCT capabilities: (0x103f) SCT S
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
- 1 Raw_Read_Error_Rate 0x000f 109 099 006 Pre-fail Always - 24527047
+ 1 Raw_Read_Error_Rate 0x000f 105 099 006 Pre-fail Always - 9353217
3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0
- 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 45
- 5 Reallocated_Sector_Ct 0x0033 099 099 036 Pre-fail Always - 45
- 7 Seek_Error_Rate 0x000f 084 060 030 Pre-fail Always - 243865949
- 9 Power_On_Hours 0x0032 078 078 000 Old_age Always - 19514
+ 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 46
+ 5 Reallocated_Sector_Ct 0x0033 094 094 036 Pre-fail Always - 286
+ 7 Seek_Error_Rate 0x000f 077 060 030 Pre-fail Always - 21724259579
+ 9 Power_On_Hours 0x0032 078 078 000 Old_age Always - 19503
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
- 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 45
+ 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 46
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
-187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
-188 Command_Timeout 0x0032 100 094 000 Old_age Always - 54
-189 High_Fly_Writes 0x003a 042 042 000 Old_age Always - 58
-190 Airflow_Temperature_Cel 0x0022 058 053 045 Old_age Always - 42 (Lifetime Min/Max 32/47)
-194 Temperature_Celsius 0x0022 042 047 000 Old_age Always - 42 (0 21 0 0)
-195 Hardware_ECC_Recovered 0x001a 049 019 000 Old_age Always - 24527047
+187 Reported_Uncorrect 0x0032 079 079 000 Old_age Always - 21
+188 Command_Timeout 0x0032 100 099 000 Old_age Always - 167506280487
+189 High_Fly_Writes 0x003a 040 040 000 Old_age Always - 60
+190 Airflow_Temperature_Cel 0x0022 065 053 045 Old_age Always - 35 (Lifetime Min/Max 27/39)
+194 Temperature_Celsius 0x0022 035 047 000 Old_age Always - 35 (0 21 0 0)
+195 Hardware_ECC_Recovered 0x001a 034 021 000 Old_age Always - 9353217
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
-240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 155602370186298
-241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 3962927302
-242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 1088105697
+240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 68637872376787
+241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 345453600
+242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 1348943976
SMART Error Log Version: 1
-No Errors Logged
+ATA Error Count: 21 (device log contains only the most recent five errors)
+ CR = Command Register [HEX]
+ FR = Features Register [HEX]
+ SC = Sector Count Register [HEX]
+ SN = Sector Number Register [HEX]
+ CL = Cylinder Low Register [HEX]
+ CH = Cylinder High Register [HEX]
+ DH = Device/Head Register [HEX]
+ DC = Device Command Register [HEX]
+ ER = Error register [HEX]
+ ST = Status register [HEX]
+Powered_Up_Time is measured from power on, and printed as
+DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
+SS=sec, and sss=millisec. It "wraps" after 49.710 days.
+
+Error 21 occurred at disk power-on lifetime: 19362 hours (806 days + 18 hours)
+ When the command that caused the error occurred, the device was active or idle.
+
+ After command completion occurred, registers were:
+ ER ST SC SN CL CH DH
+ -- -- -- -- -- -- --
+ 40 51 00 6f b9 8f 02
+
+ Commands leading to the command that caused the error were:
+ CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
+ -- -- -- -- -- -- -- -- ---------------- --------------------
+ 60 00 80 3f b9 8f 42 00 8d+05:28:25.873 READ FPDMA QUEUED
+ 27 00 00 00 00 00 e0 00 8d+05:28:25.845 READ NATIVE MAX ADDRESS EXT
+ ec 00 00 00 00 00 a0 02 8d+05:28:25.825 IDENTIFY DEVICE
+ ef 03 46 00 00 00 a0 02 8d+05:28:25.793 SET FEATURES [Set transfer mode]
+ 27 00 00 00 00 00 e0 00 8d+05:28:25.765 READ NATIVE MAX ADDRESS EXT
+
+Error 20 occurred at disk power-on lifetime: 19362 hours (806 days + 18 hours)
+ When the command that caused the error occurred, the device was active or idle.
+
+ After command completion occurred, registers were:
+ ER ST SC SN CL CH DH
+ -- -- -- -- -- -- --
+ 40 51 00 6f b9 8f 02
+
+ Commands leading to the command that caused the error were:
+ CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
+ -- -- -- -- -- -- -- -- ---------------- --------------------
+ 60 00 80 3f b9 8f 42 00 8d+05:28:22.543 READ FPDMA QUEUED
+ 27 00 00 00 00 00 e0 00 8d+05:28:22.515 READ NATIVE MAX ADDRESS EXT
+ ec 00 00 00 00 00 a0 02 8d+05:28:22.495 IDENTIFY DEVICE
+ ef 03 46 00 00 00 a0 02 8d+05:28:22.463 SET FEATURES [Set transfer mode]
+ 27 00 00 00 00 00 e0 00 8d+05:28:22.435 READ NATIVE MAX ADDRESS EXT
+
+Error 19 occurred at disk power-on lifetime: 19362 hours (806 days + 18 hours)
+ When the command that caused the error occurred, the device was active or idle.
+
+ After command completion occurred, registers were:
+ ER ST SC SN CL CH DH
+ -- -- -- -- -- -- --
+ 40 51 00 6f b9 8f 02
+
+ Commands leading to the command that caused the error were:
+ CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
+ -- -- -- -- -- -- -- -- ---------------- --------------------
+ 60 00 80 3f b9 8f 42 00 8d+05:28:19.273 READ FPDMA QUEUED
+ 27 00 00 00 00 00 e0 00 8d+05:28:19.245 READ NATIVE MAX ADDRESS EXT
+ ec 00 00 00 00 00 a0 02 8d+05:28:19.225 IDENTIFY DEVICE
+ ef 03 46 00 00 00 a0 02 8d+05:28:19.192 SET FEATURES [Set transfer mode]
+ 27 00 00 00 00 00 e0 00 8d+05:28:19.165 READ NATIVE MAX ADDRESS EXT
+
+Error 18 occurred at disk power-on lifetime: 19362 hours (806 days + 18 hours)
+ When the command that caused the error occurred, the device was active or idle.
+
+ After command completion occurred, registers were:
+ ER ST SC SN CL CH DH
+ -- -- -- -- -- -- --
+ 40 51 00 6f b9 8f 02
+
+ Commands leading to the command that caused the error were:
+ CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
+ -- -- -- -- -- -- -- -- ---------------- --------------------
+ 60 00 80 3f b9 8f 42 00 8d+05:28:15.972 READ FPDMA QUEUED
+ 27 00 00 00 00 00 e0 00 8d+05:28:15.945 READ NATIVE MAX ADDRESS EXT
+ ec 00 00 00 00 00 a0 02 8d+05:28:15.925 IDENTIFY DEVICE
+ ef 03 46 00 00 00 a0 02 8d+05:28:15.892 SET FEATURES [Set transfer mode]
+ 27 00 00 00 00 00 e0 00 8d+05:28:15.865 READ NATIVE MAX ADDRESS EXT
+
+Error 17 occurred at disk power-on lifetime: 19362 hours (806 days + 18 hours)
+ When the command that caused the error occurred, the device was active or idle.
+
+ After command completion occurred, registers were:
+ ER ST SC SN CL CH DH
+ -- -- -- -- -- -- --
+ 40 51 00 6f b9 8f 02
+
+ Commands leading to the command that caused the error were:
+ CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
+ -- -- -- -- -- -- -- -- ---------------- --------------------
+ 60 00 80 3f b9 8f 42 00 8d+05:28:12.712 READ FPDMA QUEUED
+ 27 00 00 00 00 00 e0 00 8d+05:28:12.685 READ NATIVE MAX ADDRESS EXT
+ ec 00 00 00 00 00 a0 02 8d+05:28:12.664 IDENTIFY DEVICE
+ ef 03 46 00 00 00 a0 02 8d+05:28:12.632 SET FEATURES [Set transfer mode]
+ 27 00 00 00 00 00 e0 00 8d+05:28:12.605 READ NATIVE MAX ADDRESS EXT
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
--
Bernie Innocenti
Sugar Labs Infrastructure Team
http://wiki.sugarlabs.org/go/Infrastructure_Team
More information about the Systems
mailing list