[Systems] Ailing drive on housetree

Bernie Innocenti bernie at sugarlabs.org
Sat Nov 12 01:18:19 EST 2011

On Fri, 2011-11-11 at 23:25 -0500, Chris Ball wrote:
> Hi Bernie,
> On Fri, Nov 11 2011, Bernie Innocenti wrote:
> > Today I finally figured out why housetree was reporting high load
> > occasionally without any apparent activity on the VMs.
> >
> > It turns out that sdb is dying, and we didn't even have smartd running.
> > Luckily, sdb is part of RAID1 arrays with triple-redundancy. We could
> > continue operating with 2 drives for some time, but I'd feel safer if we
> > replaced the drive as soon as possible. After all, the remaining drives
> > are the same model and have been operating for the same amount of time
> > (806 days).
> Dumb question for my own curiosity: what in the smartctl output tells you
> that the disk is failing, compared to when run against the other disks?

I should have published the diffs:

housetree:~# diff -up sda sdb
--- sda	2011-11-11 21:12:15.083206106 -0500
+++ sdb	2011-11-11 21:12:18.183198752 -0500
@@ -4,13 +4,13 @@ Copyright (C) 2002-10 by Bruce Allen, ht
 Model Family:     Seagate Barracuda 7200.11 family
 Device Model:     ST31000333AS
-Serial Number:    9TE1F22A
+Serial Number:    9TE1DRMJ
 Firmware Version: CC3H
 User Capacity:    1,000,204,886,016 bytes
 Device is:        In smartctl database [for details use: -P show]
 ATA Version is:   8
 ATA Standard is:  ATA-8-ACS revision 4
-Local Time is:    Fri Nov 11 21:12:14 2011 EST
+Local Time is:    Fri Nov 11 21:12:17 2011 EST
 SMART support is: Available - device has SMART capability.
 SMART support is: Enabled
@@ -43,7 +43,7 @@ Error logging capability:        (0x01)
 Short self-test routine 
 recommended polling time: 	 (   1) minutes.
 Extended self-test routine
-recommended polling time: 	 ( 202) minutes.
+recommended polling time: 	 ( 206) minutes.
 Conveyance self-test routine
 recommended polling time: 	 (   2) minutes.
 SCT capabilities: 	       (0x103f)	SCT Status supported.
@@ -54,30 +54,128 @@ SCT capabilities: 	       (0x103f)	SCT S
 SMART Attributes Data Structure revision number: 10
 Vendor Specific SMART Attributes with Thresholds:
-  1 Raw_Read_Error_Rate     0x000f   109   099   006    Pre-fail  Always       -       24527047
+  1 Raw_Read_Error_Rate     0x000f   105   099   006    Pre-fail  Always       -       9353217
   3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
-  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       45
-  5 Reallocated_Sector_Ct   0x0033   099   099   036    Pre-fail  Always       -       45
-  7 Seek_Error_Rate         0x000f   084   060   030    Pre-fail  Always       -       243865949
-  9 Power_On_Hours          0x0032   078   078   000    Old_age   Always       -       19514
+  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       46
+  5 Reallocated_Sector_Ct   0x0033   094   094   036    Pre-fail  Always       -       286
+  7 Seek_Error_Rate         0x000f   077   060   030    Pre-fail  Always       -       21724259579
+  9 Power_On_Hours          0x0032   078   078   000    Old_age   Always       -       19503
  10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
- 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       45
+ 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       46
 184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
-187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
-188 Command_Timeout         0x0032   100   094   000    Old_age   Always       -       54
-189 High_Fly_Writes         0x003a   042   042   000    Old_age   Always       -       58
-190 Airflow_Temperature_Cel 0x0022   058   053   045    Old_age   Always       -       42 (Lifetime Min/Max 32/47)
-194 Temperature_Celsius     0x0022   042   047   000    Old_age   Always       -       42 (0 21 0 0)
-195 Hardware_ECC_Recovered  0x001a   049   019   000    Old_age   Always       -       24527047
+187 Reported_Uncorrect      0x0032   079   079   000    Old_age   Always       -       21
+188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       167506280487
+189 High_Fly_Writes         0x003a   040   040   000    Old_age   Always       -       60
+190 Airflow_Temperature_Cel 0x0022   065   053   045    Old_age   Always       -       35 (Lifetime Min/Max 27/39)
+194 Temperature_Celsius     0x0022   035   047   000    Old_age   Always       -       35 (0 21 0 0)
+195 Hardware_ECC_Recovered  0x001a   034   021   000    Old_age   Always       -       9353217
 197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
 198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
 199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
-240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       155602370186298
-241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       3962927302
-242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       1088105697
+240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       68637872376787
+241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       345453600
+242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       1348943976
 SMART Error Log Version: 1
-No Errors Logged
+ATA Error Count: 21 (device log contains only the most recent five errors)
+	CR = Command Register [HEX]
+	FR = Features Register [HEX]
+	SC = Sector Count Register [HEX]
+	SN = Sector Number Register [HEX]
+	CL = Cylinder Low Register [HEX]
+	CH = Cylinder High Register [HEX]
+	DH = Device/Head Register [HEX]
+	DC = Device Command Register [HEX]
+	ER = Error register [HEX]
+	ST = Status register [HEX]
+Powered_Up_Time is measured from power on, and printed as
+DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
+SS=sec, and sss=millisec. It "wraps" after 49.710 days.
+Error 21 occurred at disk power-on lifetime: 19362 hours (806 days + 18 hours)
+  When the command that caused the error occurred, the device was active or idle.
+  After command completion occurred, registers were:
+  -- -- -- -- -- -- --
+  40 51 00 6f b9 8f 02
+  Commands leading to the command that caused the error were:
+  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
+  -- -- -- -- -- -- -- --  ----------------  --------------------
+  60 00 80 3f b9 8f 42 00   8d+05:28:25.873  READ FPDMA QUEUED
+  27 00 00 00 00 00 e0 00   8d+05:28:25.845  READ NATIVE MAX ADDRESS EXT
+  ec 00 00 00 00 00 a0 02   8d+05:28:25.825  IDENTIFY DEVICE
+  ef 03 46 00 00 00 a0 02   8d+05:28:25.793  SET FEATURES [Set transfer mode]
+  27 00 00 00 00 00 e0 00   8d+05:28:25.765  READ NATIVE MAX ADDRESS EXT
+Error 20 occurred at disk power-on lifetime: 19362 hours (806 days + 18 hours)
+  When the command that caused the error occurred, the device was active or idle.
+  After command completion occurred, registers were:
+  -- -- -- -- -- -- --
+  40 51 00 6f b9 8f 02
+  Commands leading to the command that caused the error were:
+  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
+  -- -- -- -- -- -- -- --  ----------------  --------------------
+  60 00 80 3f b9 8f 42 00   8d+05:28:22.543  READ FPDMA QUEUED
+  27 00 00 00 00 00 e0 00   8d+05:28:22.515  READ NATIVE MAX ADDRESS EXT
+  ec 00 00 00 00 00 a0 02   8d+05:28:22.495  IDENTIFY DEVICE
+  ef 03 46 00 00 00 a0 02   8d+05:28:22.463  SET FEATURES [Set transfer mode]
+  27 00 00 00 00 00 e0 00   8d+05:28:22.435  READ NATIVE MAX ADDRESS EXT
+Error 19 occurred at disk power-on lifetime: 19362 hours (806 days + 18 hours)
+  When the command that caused the error occurred, the device was active or idle.
+  After command completion occurred, registers were:
+  -- -- -- -- -- -- --
+  40 51 00 6f b9 8f 02
+  Commands leading to the command that caused the error were:
+  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
+  -- -- -- -- -- -- -- --  ----------------  --------------------
+  60 00 80 3f b9 8f 42 00   8d+05:28:19.273  READ FPDMA QUEUED
+  27 00 00 00 00 00 e0 00   8d+05:28:19.245  READ NATIVE MAX ADDRESS EXT
+  ec 00 00 00 00 00 a0 02   8d+05:28:19.225  IDENTIFY DEVICE
+  ef 03 46 00 00 00 a0 02   8d+05:28:19.192  SET FEATURES [Set transfer mode]
+  27 00 00 00 00 00 e0 00   8d+05:28:19.165  READ NATIVE MAX ADDRESS EXT
+Error 18 occurred at disk power-on lifetime: 19362 hours (806 days + 18 hours)
+  When the command that caused the error occurred, the device was active or idle.
+  After command completion occurred, registers were:
+  -- -- -- -- -- -- --
+  40 51 00 6f b9 8f 02
+  Commands leading to the command that caused the error were:
+  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
+  -- -- -- -- -- -- -- --  ----------------  --------------------
+  60 00 80 3f b9 8f 42 00   8d+05:28:15.972  READ FPDMA QUEUED
+  27 00 00 00 00 00 e0 00   8d+05:28:15.945  READ NATIVE MAX ADDRESS EXT
+  ec 00 00 00 00 00 a0 02   8d+05:28:15.925  IDENTIFY DEVICE
+  ef 03 46 00 00 00 a0 02   8d+05:28:15.892  SET FEATURES [Set transfer mode]
+  27 00 00 00 00 00 e0 00   8d+05:28:15.865  READ NATIVE MAX ADDRESS EXT
+Error 17 occurred at disk power-on lifetime: 19362 hours (806 days + 18 hours)
+  When the command that caused the error occurred, the device was active or idle.
+  After command completion occurred, registers were:
+  -- -- -- -- -- -- --
+  40 51 00 6f b9 8f 02
+  Commands leading to the command that caused the error were:
+  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
+  -- -- -- -- -- -- -- --  ----------------  --------------------
+  60 00 80 3f b9 8f 42 00   8d+05:28:12.712  READ FPDMA QUEUED
+  27 00 00 00 00 00 e0 00   8d+05:28:12.685  READ NATIVE MAX ADDRESS EXT
+  ec 00 00 00 00 00 a0 02   8d+05:28:12.664  IDENTIFY DEVICE
+  ef 03 46 00 00 00 a0 02   8d+05:28:12.632  SET FEATURES [Set transfer mode]
+  27 00 00 00 00 00 e0 00   8d+05:28:12.605  READ NATIVE MAX ADDRESS EXT
 SMART Self-test log structure revision number 1
 No self-tests have been logged.  [To run self-tests, use: smartctl -t]

Bernie Innocenti
Sugar Labs Infrastructure Team

More information about the Systems mailing list