[Sugar-devel] [PATCH sugar-0.84] journal scan of external media

James Cameron quozl at laptop.org
Mon Sep 13 05:19:07 EDT 2010


On Fri, Sep 10, 2010 at 05:22:58PM +0200, Sascha Silbe wrote:
> Excerpts from James Cameron's message of Thu Sep 09 03:24:36 +0200 2010:
> > > > +        if None not in [self._date_start, self._date_end] and \
> > > > +            (stat.st_mtime < self._date_start or
> > > > +             stat.st_mtime > self._date_end):
> > > > +            add_to_list = False
> > > 
> > > How about either checking just self._date_start or splitting it up into
> > > two separate rules?
> > 
> > Don't understand, sorry.
> 
> I mean either:
> 
>         if self._date_start is not None and self.st_mtime < self._date_start:
>             return True
>         if self._date_end is not None and self.st_mtime > self._date_end:
>             return True

Thanks, taken.

> [ENOTDIR during os.listdir()]
> > The reproducible instance is /media/disk/dev/fd/24 ... since fd is a
> > symlink to /proc/self/fd ... (awesome isn't it?  One can insert a USB key
> > that will list the contents of any directory on the laptop, provided
> > access is allowed.)
> 
> Ah, it's good that came up during testing, didn't think about it. We
> should make sure symlinks don't point outside the filesystem by resolving
> them (os.path.abspath()) and checking against the root path.

Agreed.  ENOTDIR removed and os.path.abspath added.

> > os.lstat of an entry returned by os.listdir
> [...]
> >     exception EIO a total of 7 times
> 
> Interesting. Was the storage medium broken in some way (read error)?
> I don't even see EIO as a documented error code for lstat(2).

Yes, storage medium is broken.  It is a USB HDD with reallocated sectors
and no spare area sectors available, or with a pending unstable sector
waiting to be remapped.  dmesg shows:

[ 5070.582713] sd 7:0:0:0: [sda] Unhandled sense code
[ 5070.582726] sd 7:0:0:0: [sda] Result: hostbyte=DID_ERROR driverbyte=DRIVER_SENSE
[ 5070.582740] sd 7:0:0:0: [sda] Sense Key : Hardware Error [current] 
[ 5070.582757] sd 7:0:0:0: [sda] Add. Sense: No additional sense information
[ 5070.582774] end_request: I/O error, dev sda, sector 66846816
[ 5070.584114] EXT2-fs error (device sda1): ext2_get_inode: unable to read inode block - inode=2084987, block=8355848

The drive itself is reasonably old and should have completely failed
long ago.

> > > Why? The order in which os.listdir() returns the entries is undefined
> > > anyway.
> > 
> > When one uses os.listdir and then reverses the order before going
> > through them, the resulting performance is low, since the files are
> > being accessed in reverse.  Because I didn't know about pop(0), I had to
> > use .reverse() to maintain ordering.
> 
> Ah, don't you love how undefined behaviour still affects performance?

I think that's quite expected, especially since performance is equally
undefined.  Although the order may not be defined, there is likely to be
an ordering imposed by the underlying implementation.

> [_scan()]
> > +    if len(self._pending_files) > 0:
> > +        return self._scan_a_file()
> > +
> > +    if len(self._pending_directories) > 0:
> > +        return self._scan_a_directory()
> 
> PEP-8 says we should use "if self._pending_*" instead (without len()).
> 
> If we just call the functions and return True in _scan(), we don't
> need to ensure that we always return True in self._scan_a_*().

Thanks, taken both points.

On Fri, Sep 10, 2010 at 06:27:50PM +0200, Sascha Silbe wrote:
> Excerpts from James Cameron's message of Thu Sep 09 06:39:53 +0200 2010:
> 
> > +            if link == '.':
> > +                return True
> 
> > +            if link.startswith('/') and full_path.startswith(link):
> > +                return True
> 
> Both shouldn't be necessary (will be catched by the loop detection).
> 
> As mentioned in the other mail, we should also make sure we stay on the
> current device:
> 
>         if not os.path.abspath(full_path).startswith(self._mount_point):
>             return True

Taken, with modification ... only need to check the absolute path of a
link after it is read, not before.

Repeated test on Sugar 0.84 with my set of working and broken devices.
The following diff is proposed for final review.

diff --git a/src/jarabe/journal/model.py b/src/jarabe/journal/model.py
index 50e8dc1..6b1011c 100644
--- a/src/jarabe/journal/model.py
+++ b/src/jarabe/journal/model.py
@@ -1,4 +1,4 @@
-# Copyright (C) 2007-2008, One Laptop Per Child
+# Copyright (C) 2007-2010, One Laptop per Child
 #
 # This program is free software; you can redistribute it and/or modify
 # it under the terms of the GNU General Public License as published by
@@ -16,10 +16,11 @@
 
 import logging
 import os
+import errno
 from datetime import datetime
 import time
 import shutil
-from stat import S_IFMT, S_IFDIR, S_IFREG
+from stat import S_IFLNK, S_IFMT, S_IFDIR, S_IFREG
 import traceback
 import re
 
@@ -258,7 +259,9 @@ class InplaceResultSet(BaseResultSet):
         BaseResultSet.__init__(self, query, cache_limit)
         self._mount_point = mount_point
         self._file_list = None
-        self._pending_directories = 0
+        self._pending_directories = []
+        self._visited_directories = []
+        self._pending_files = []
         self._stopped = False
 
         query_text = query.get('query', '')
@@ -283,7 +286,10 @@ class InplaceResultSet(BaseResultSet):
 
     def setup(self):
         self._file_list = []
-        self._recurse_dir(self._mount_point)
+        self._pending_directories = [self._mount_point]
+        self._visited_directories = []
+        self._pending_files = []
+        gobject.idle_add(self._scan)
 
     def stop(self):
         self._stopped = True
@@ -317,51 +323,100 @@ class InplaceResultSet(BaseResultSet):
 
         return entries, total_count
 
-    def _recurse_dir(self, dir_path):
+    def _scan(self):
         if self._stopped:
+            return False
+
+        self.progress.send(self)
+
+        if self._pending_files:
+            self._scan_a_file()
+            return True
+
+        if self._pending_directories:
+            self._scan_a_directory()
+            return True
+
+        self.setup_ready()
+        self._visited_directories = []
+        return False
+
+    def _scan_a_file(self):
+        full_path = self._pending_files.pop(0)
+
+        try:
+            stat = os.lstat(full_path)
+        except OSError, e:
+            if e.errno != errno.ENOENT:
+                logging.exception(
+                    'Error reading metadata of file %r', full_path)
+            return
+
+        if S_IFMT(stat.st_mode) == S_IFLNK:
+            try:
+                link = os.readlink(full_path)
+            except OSError, e:
+                logging.exception(
+                    'Error reading target of link %r', full_path)
+                return
+
+            if not os.path.abspath(link).startswith(self._mount_point):
+                return
+
+            try:
+                stat = os.stat(full_path)
+
+            except OSError, e:
+                if e.errno != errno.ENOENT:
+                    logging.exception(
+                        'Error reading metadata of linked file %r', full_path)
+                return
+
+        if S_IFMT(stat.st_mode) == S_IFDIR:
+            id_tuple = stat.st_ino, stat.st_dev
+            if not id_tuple in self._visited_directories:
+                self._visited_directories.append(id_tuple)
+                self._pending_directories.append(full_path)
+            return
+
+        if S_IFMT(stat.st_mode) != S_IFREG:
+            return
+
+        if self._regex is not None and \
+                not self._regex.match(full_path):
+            return
+
+        if self._date_start is not None and self.st_mtime < self._date_start:
+            return
+
+        if self._date_end is not None and self.st_mtime > self._date_end:
+            return
+
+        if self._mime_types:
+            mime_type = gio.content_type_guess(filename=full_path)
+            if mime_type not in self._mime_types:
+                return
+
+        file_info = (full_path, stat, int(stat.st_mtime))
+        self._file_list.append(file_info)
+
+        return
+
+    def _scan_a_directory(self):
+        dir_path = self._pending_directories.pop(0)
+
+        try:
+            entries = os.listdir(dir_path)
+        except OSError, e:
+            if e.errno != errno.EACCES:
+                logging.exception('Error reading directory %r', dir_path)
             return
 
-        for entry in os.listdir(dir_path):
+        for entry in entries:
             if entry.startswith('.'):
                 continue
-            full_path = dir_path + '/' + entry
-            try:
-                stat = os.stat(full_path)
-                if S_IFMT(stat.st_mode) == S_IFDIR:
-                    self._pending_directories += 1
-                    gobject.idle_add(lambda s=full_path: self._recurse_dir(s))
-
-                elif S_IFMT(stat.st_mode) == S_IFREG:
-                    add_to_list = True
-
-                    if self._regex is not None and \
-                            not self._regex.match(full_path):
-                        add_to_list = False
-
-                    if None not in [self._date_start, self._date_end] and \
-                            (stat.st_mtime < self._date_start or
-                             stat.st_mtime > self._date_end):
-                        add_to_list = False
-
-                    if self._mime_types:
-                        mime_type = gio.content_type_guess(filename=full_path)
-                        if mime_type not in self._mime_types:
-                            add_to_list = False
-
-                    if add_to_list:
-                        file_info = (full_path, stat, int(stat.st_mtime))
-                        self._file_list.append(file_info)
-
-                    self.progress.send(self)
-
-            except Exception:
-                logging.error('Error reading file %r: %s' % \
-                              (full_path, traceback.format_exc()))
-
-        if self._pending_directories == 0:
-            self.setup_ready()
-        else:
-            self._pending_directories -= 1
+            self._pending_files.append(dir_path + '/' + entry)
+        return
 
 def _get_file_metadata(path, stat):
     client = gconf.client_get_default()

-- 
James Cameron
http://quozl.linux.org.au/


More information about the Sugar-devel mailing list