[Bugs] #1090 UNSP: refactor indexing/query parsing, add support for prefixes

SugarLabs Bugs bugtracker-noreply at sugarlabs.org
Tue Aug 11 12:48:02 EDT 2009


#1090: refactor indexing/query parsing, add support for prefixes
------------------------------------------+---------------------------------
    Reporter:  sascha_silbe               |          Owner:  tomeu                      
        Type:  enhancement                |         Status:  new                        
    Priority:  Unspecified by Maintainer  |      Milestone:  Unspecified by Release Team
   Component:  sugar-datastore            |        Version:  Git as of bugdate          
    Severity:  Unspecified                |     Resolution:                             
    Keywords:  r-                         |   Distribution:  Unspecified                
Status_field:  New                        |  
------------------------------------------+---------------------------------
Changes (by tomeu):

  * keywords:  r? => r-


Comment:

 Replying to [comment:5 sascha_silbe]:
 > Replying to [comment:4 tomeu]:
 > > I'm unsure about date ranges, do you really expect people will use
 them without any hint about the expected syntax?
 > I don't expect them to use any non-basic query without proper
 documentation. The DateRangeProcessor was for mostly for completeness and
 can be considered an independent, new feature so I've removed it for now.
 If you think it might still be useful, we can work on a better version
 (e.g. with the Journal translating the date format) in a separate ticket.

 Well, I cannot really say myself because are far from sugar-using
 children. We should ask in the mailing list for opinions.

 > > There's also the issue with the Journal appending a * to the query,
 how do you plan to solve this?
 > It's been a PITA during testing; would like to switch to FLAG_PARTIAL
 instead. The semantics are slightly different, though: FLAG_PARTIAL only
 does an implicit wildcard match on the last word, not on all.

 That will change the UI, I remember Marco working on this closely with
 Eben. Can you get feedback from people on this issue? Otherwise they are
 going to say we broke the search field ;)

 > > Are we changing anything in what is stored in the index? If so, the
 index needs to be rebuilt?
 > Yes and yes. I've not added code for doing the rebuild as I hope to get
 the rest of my changes into mainline for 0.88 and these already do an
 index rebuild because of the datastore layout changes.

 Sounds good.

 > > Any change in the size of the index?
 > It'll be significantly bigger. We currently store the "known" metadata
 twice (see the comment quoted below for the reason - basically it's
 required by Xapian and probably needs upstream changes to fix) and also
 include position information. The latter is needed for "phrase" searching
 to work, e.g. "mime_type:text/plain" ("text" and "plain" are considered
 separate words). Also we store each value both word-split and as a literal
 term (see below - to be changed before 0.88).

 I guess deployers would be able to help evaluate this. Can you ask in the
 mailing list addressing specifically dsd and martin langhoff? I guess they
 will ask you for some numbers. Would be good to know how it impacts
 performance on the XO.

 > >>         # TODO: change query parser to generate phrase query instead
 > > Can you extend on this?
 > We want the user to be able to search for single words (e.g. for
 "tags"), so we split words during indexing. But dictionary-based queries
 currently use the passed value literally, i.e. without word splitting. So
 we need to store it as a single, unsplit term as well. By changing our
 dictionary query "parser" (rather: builder) to do word-splitting and using
 phrase searching, we can drop that duplication, thereby reducing the
 storage size significantly.

 Any downsides? How hard would it be to implement that?

 > >> +            # compatibility option for timestamp: {'start': 0,
 'end': 1}
 > > Does this mean that ranges in the query dict would get deprecated by
 the ranges in the query string?
 > No, deprecated by ranges expressed using tuples. Unlike alsroot, I want
 to keep the dictionary-based interface and even see it as the primary
 interface for by most activities.

 Fine with me, though I don't see much value in changing that.

 > >> +        except xapian.QueryParserError, exception:
 > >> +            logging.warning("Invalid query string:
 "+exception.get_msg())
 > >> +            return Query()
 > > Hmm, if the exception was propagated, the Journal could say that an
 error has happened, similar to how we say today that the query returned no
 results. Wouldn't that be more useful feedback than returning the results
 of a blank query?
 > I'm not sure, it should probably be discussed with the Design Team. BTW,
 Query() matches nothing at all, so isn't the same as passing an blank
 query into the datastore. So the user is already getting some kind of
 feedback.

 You are right. If you make it raise an exception and enter a ticket about
 it, I will update the journal.

 > >> +        queries += [
 > >> +            self._parse_query_term(name, prefix,
 query_dict.pop(name))
 > >> +            for (name, prefix) in _query_term_map.items()
 > >> +            if name in query_dict]
 > > Please split this in several statements. Not only helps with
 readability but also keeps future patches more clear.
 > The only way I could split this into several statements is by doing a
 regular for loop and adding each item individually using append(). This is
 probably slower, though I'm not sure if it'll make a real difference.

 I think that's fine, readability comes before non-measured performance.

 > >> +        self._query_parser = QueryParser()
 > >> +        self._query_parser.set_database(self._database)
 > >> +        self._term_generator = TermGenerator()
 > > Are these operations expensive enough to cache those objects? Don't
 remember seeing them when I profiled the DS.
 > It seemed wasteful to regenerate them every time and very easy to
 change. Haven't profiled the DS yet.

 Not a big deal, but trading memory usage for reduced cpu usage before
 measuring doesn't seem like a good idea. Also all variables should have
 the smallest scope they have, for higher cohesion.

 > > I'm trying to improve the review process docs, comments welcome:
 > An automatic style checker that catches more of your stylistic
 complaints (string separator, spaces between operators, ...) would be
 great as it reduces the number of review rounds.

 True, I agree it sucks right now, but if you go browse the gnome and
 mozilla bugzillas for code reviews, you will see that people are expected
 to be quite careful about style on their first patches. And I have never
 seen a C/C++ project that had tools to check style, they just have a page
 explaining their code style guidelines:

 http://developer.gnome.org/doc/guides/programming-guidelines/book1.html
 https://developer.mozilla.org/En/Mozilla_Coding_Style_Guide

 I'm personally ok with the patch, if you get favorable feedback from UI
 and OLPC people, we can push it as-is.

-- 
Ticket URL: <http://trac.sugarlabs.org/ticket/1090#comment:6>
Sugar Labs <http://sugarlabs.org/>
Sugar Labs bug tracking system


More information about the Bugs mailing list