[Sugar-devel] [PATCH sugar-0.84] journal scan of external media

James Cameron quozl at laptop.org
Wed Sep 8 21:24:36 EDT 2010


Thanks for the review.  Detailed response to Tomeu and Sascha's review
below, followed by revised patch.

On Wed, Sep 08, 2010 at 09:20:21AM +0200, Tomeu Vizoso wrote:
> FWIW, I think this should go in a single patch because it doesn't make
> sense to cherry-pick pieces of a single refactoring/rewrite.

Thanks.

> About the tabs change, the -w flag to git diff and blame can cope with
> it, but I think when someone finds tabs should just push that change
> inmediately. (Though I have grepped through master and have only found
> tabs in invites.py).

Sorry, I used tabify in emacs, so as to force the diff to show the
entire function as changed.  I had said I "replaced tabs with spaces",
but I should have said "replaced spaces with tabs".  This is only
because I don't know how to drive git diff well enough.

> > I think it needs to be split into smaller functions,
> 
> +1

Done.

On Wed, Sep 08, 2010 at 05:32:39PM +0200, Sascha Silbe wrote:
> Are you sure you used spaces? The patch looked strangely formatted in
> both my MUA and my editor.

Tabs, sorry.

> You can simplify this by using logging.exception:

Done.

> While we're at it, let's improve on the original code:
> 
> > +        if self._regex is not None and \
> > +            not self._regex.match(full_path):
> > +            add_to_list = False
> 
> All following ifs should be elif to avoid further examining entries that
> get skipped anyway. Especially for the MIME type comparison it will make
> a huge difference as it scans the file content.

Done.  Rewritten as guards with premature return.

> > +        if None not in [self._date_start, self._date_end] and \
> > +            (stat.st_mtime < self._date_start or
> > +             stat.st_mtime > self._date_end):
> > +            add_to_list = False
> 
> How about either checking just self._date_start or splitting it up into
> two separate rules?

Don't understand, sorry.

> > +        if self._mime_types:
> > +            mime_type = gio.content_type_guess(filename=full_path)
> 
> We should use a single way to derive MIME types (sugar.mime) in our code
> to avoid using different MIME types in different parts of Sugar.

I don't understand sugar.mime enough to do that.

> Back to your code:
> Why not .pop(0)?

Because I didn't know about it.  Thanks.  Fixed.

> > +        try:
> > +        entries = os.listdir(dir_path)
> > +        except OSError, e:
> > +        if e.errno not in [errno.ENOENT, errno.EACCES,
> > +                   errno.EPERM, errno.ENOTDIR]:
> > +            logging.error('Error reading directory %r: %s' %
> > +                  (dir_path, traceback.format_exc()))
> 
> logging.exception() again. Also should we really ignore ENOTDIR?
> I don't see any way we could trigger ENOTDIR. If a symlink is broken,
> we won't recognise it as a directory, even if it was meant to point to
> a directory.

ENOTDIR does happen and is reproducible.  The reason it happens is that
os.lstat reads a link, os.readlink follows it, and os.stat identifies
the result as a directory, it is added to the list, and then os.listdir
fails to list the directory.

The reproducible instance is /media/disk/dev/fd/24 ... since fd is a
symlink to /proc/self/fd ... (awesome isn't it?  One can insert a USB key
that will list the contents of any directory on the laptop, provided
access is allowed.)

On nine test filesystems of varying sizes, the following exceptions were
counted:

os.listdir
	exception EACCES a total of 95 times
	exception ENOTDIR a total of 5 times

os.lstat of an entry returned by os.listdir
	exception ENOENT a total of 5 times
	exception EIO a total of 7 times

os.stat of an entry that os.lstat showed was a link
	exception ENOENT a total of 7935 times

os.readlink
	never failed

> > +        self._pending_files.append(dir_path + '/' + entry)
> 
> What's the memory footprint of the two lists?

Depends on the population of the filesystem.  I don't consider it
important.  The pending files list never grows larger than the largest
directory.  The pending directory list never grows larger than the total
number of directories.

On the test filesystem of 73k files, the pending directory list in raw
form is about 250k, and worst case whole path list from "find /mnt" is
4MB.

> > +        self._pending_files.reverse()
> 
> Why? The order in which os.listdir() returns the entries is undefined
> anyway.

When one uses os.listdir and then reverses the order before going
through them, the resulting performance is low, since the files are
being accessed in reverse.  Because I didn't know about pop(0), I had to
use .reverse() to maintain ordering.


diff --git a/src/jarabe/journal/model.py b/src/jarabe/journal/model.py
index 50e8dc1..81363fb 100644
--- a/src/jarabe/journal/model.py
+++ b/src/jarabe/journal/model.py
@@ -16,10 +16,11 @@
 
 import logging
 import os
+import errno
 from datetime import datetime
 import time
 import shutil
-from stat import S_IFMT, S_IFDIR, S_IFREG
+from stat import S_IFLNK, S_IFMT, S_IFDIR, S_IFREG
 import traceback
 import re
 
@@ -258,7 +259,8 @@ class InplaceResultSet(BaseResultSet):
         BaseResultSet.__init__(self, query, cache_limit)
         self._mount_point = mount_point
         self._file_list = None
-        self._pending_directories = 0
+        self._pending_directories = []
+        self._pending_files = []
         self._stopped = False
 
         query_text = query.get('query', '')
@@ -283,7 +285,9 @@ class InplaceResultSet(BaseResultSet):
 
     def setup(self):
         self._file_list = []
-        self._recurse_dir(self._mount_point)
+        self._pending_directories = [self._mount_point]
+        self._pending_files = []
+        gobject.idle_add(self._scan)
 
     def stop(self):
         self._stopped = True
@@ -317,51 +321,95 @@ class InplaceResultSet(BaseResultSet):
 
         return entries, total_count
 
-    def _recurse_dir(self, dir_path):
-        if self._stopped:
-            return
-
-        for entry in os.listdir(dir_path):
-            if entry.startswith('.'):
-                continue
-            full_path = dir_path + '/' + entry
-            try:
-                stat = os.stat(full_path)
-                if S_IFMT(stat.st_mode) == S_IFDIR:
-                    self._pending_directories += 1
-                    gobject.idle_add(lambda s=full_path: self._recurse_dir(s))
-
-                elif S_IFMT(stat.st_mode) == S_IFREG:
-                    add_to_list = True
-
-                    if self._regex is not None and \
-                            not self._regex.match(full_path):
-                        add_to_list = False
-
-                    if None not in [self._date_start, self._date_end] and \
-                            (stat.st_mtime < self._date_start or
-                             stat.st_mtime > self._date_end):
-                        add_to_list = False
-
-                    if self._mime_types:
-                        mime_type = gio.content_type_guess(filename=full_path)
-                        if mime_type not in self._mime_types:
-                            add_to_list = False
-
-                    if add_to_list:
-                        file_info = (full_path, stat, int(stat.st_mtime))
-                        self._file_list.append(file_info)
-
-                    self.progress.send(self)
-
-            except Exception:
-                logging.error('Error reading file %r: %s' % \
-                              (full_path, traceback.format_exc()))
-
-        if self._pending_directories == 0:
-            self.setup_ready()
-        else:
-            self._pending_directories -= 1
+    def _scan(self):
+	if self._stopped:
+	    return False
+
+	self.progress.send(self)
+
+	if len(self._pending_files) > 0:
+	    return self._scan_a_file()
+
+	if len(self._pending_directories) > 0:
+	    return self._scan_a_directory()
+
+	self.setup_ready()
+	return False
+
+    def _scan_a_file(self):
+	full_path = self._pending_files.pop(0)
+
+	try:
+	    stat = os.lstat(full_path)
+	except OSError, e:
+	    if e.errno != errno.ENOENT:
+		logging.exception(
+		    'Error reading metadata of file %r', full_path)
+	    return True
+
+	if S_IFMT(stat.st_mode) == S_IFLNK:
+	    try:
+		link = os.readlink(full_path)
+	    except OSError, e:
+		logging.exception(
+		    'Error reading target of link %r', full_path)
+		return True
+
+	    if link == '.':
+		return True
+	    if link.startswith('/') and full_path.startswith(link):
+		return True
+
+	    try:
+		stat = os.stat(full_path)
+
+	    except OSError, e:
+		if e.errno != errno.ENOENT:
+		    logging.exception(
+			'Error reading metadata of linked file %r', full_path)
+		return True
+
+	if S_IFMT(stat.st_mode) == S_IFDIR:
+	    self._pending_directories.append(full_path)
+	    return True
+
+	if S_IFMT(stat.st_mode) != S_IFREG:
+	    return True
+
+	if self._regex is not None and \
+		not self._regex.match(full_path):
+	    return True
+
+	if None not in [self._date_start, self._date_end] and \
+		(stat.st_mtime < self._date_start or
+		 stat.st_mtime > self._date_end):
+	    return True
+
+	if self._mime_types:
+	    mime_type = gio.content_type_guess(filename=full_path)
+	    if mime_type not in self._mime_types:
+		return True
+
+	file_info = (full_path, stat, int(stat.st_mtime))
+	self._file_list.append(file_info)
+
+	return True
+
+    def _scan_a_directory(self):
+	dir_path = self._pending_directories.pop(0)
+
+	try:
+	    entries = os.listdir(dir_path)
+	except OSError, e:
+	    if e.errno not in [errno.EACCES, errno.ENOTDIR]:
+		logging.exception('Error reading directory %r', dir_path)
+	    return True
+
+	for entry in entries:
+	    if entry.startswith('.'):
+		continue
+	    self._pending_files.append(dir_path + '/' + entry)
+	return True
 
 def _get_file_metadata(path, stat):
     client = gconf.client_get_default()
-- 
James Cameron
http://quozl.linux.org.au/


More information about the Sugar-devel mailing list