[Bugs] #1090 UNSP: refactor indexing/query parsing, add support for prefixes

SugarLabs Bugs bugtracker-noreply at sugarlabs.org
Sun Aug 9 15:13:18 EDT 2009


#1090: refactor indexing/query parsing, add support for prefixes
------------------------------------------+---------------------------------
    Reporter:  sascha_silbe               |          Owner:  tomeu                      
        Type:  enhancement                |         Status:  new                        
    Priority:  Unspecified by Maintainer  |      Milestone:  Unspecified by Release Team
   Component:  sugar-datastore            |        Version:  Git as of bugdate          
    Severity:  Unspecified                |     Resolution:                             
    Keywords:  r?                         |   Distribution:  Unspecified                
Status_field:  New                        |  
------------------------------------------+---------------------------------
Changes (by sascha_silbe):

  * keywords:  r- => r?


Comment:

 Replying to [comment:4 tomeu]:
 > I'm unsure about date ranges, do you really expect people will use them
 without any hint about the expected syntax?
 I don't expect them to use any non-basic query without proper
 documentation. The DateRangeProcessor was for mostly for completeness and
 can be considered an independent, new feature so I've removed it for now.
 If you think it might still be useful, we can work on a better version
 (e.g. with the Journal translating the date format) in a separate ticket.

 > There's also the issue with the Journal appending a * to the query, how
 do you plan to solve this?
 It's been a PITA during testing; would like to switch to FLAG_PARTIAL
 instead. The semantics are slightly different, though: FLAG_PARTIAL only
 does an implicit wildcard match on the last word, not on all.

 > Are we changing anything in what is stored in the index? If so, the
 index needs to be rebuilt?
 Yes and yes. I've not added code for doing the rebuild as I hope to get
 the rest of my changes into mainline for 0.88 and these already do an
 index rebuild because of the datastore layout changes.

 > Any change in the size of the index?
 It'll be significantly bigger. We currently store the "known" metadata
 twice (see the comment quoted below for the reason - basically it's
 required by Xapian and probably needs upstream changes to fix) and also
 include position information. The latter is needed for "phrase" searching
 to work, e.g. "mime_type:text/plain" ("text" and "plain" are considered
 separate words). Also we store each value both word-split and as a literal
 term (see below - to be changed before 0.88).

 >> _query_term_map = {
 > See PEP8 about the naming of constants.
 Forgot to rename after moving from class to global. Fixed.

 >>         # add full value for dictionary-based searches
 > ...
 >>         # index with and without prefix so specifying a prefix is
 optional
 >>         # in query strings
 > If you need comments to explain what your code does, consider
 restructuring it so it's clearer, or use API comments instead (likely the
 best option in this case).
 I have reworded those comments to make it clear that they are explanations
 of _why_ we are doing things this way, not of _what_ we are doing. I don't
 think splitting the function up (which would be required to turn the
 comments into docstrings) makes sense here.

 >>         # TODO: change query parser to generate phrase query instead
 > Can you extend on this?
 We want the user to be able to search for single words (e.g. for "tags"),
 so we split words during indexing. But dictionary-based queries currently
 use the passed value literally, i.e. without word splitting. So we need to
 store it as a single, unsplit term as well. By changing our dictionary
 query "parser" (rather: builder) to do word-splitting and using phrase
 searching, we can drop that duplication, thereby reducing the storage size
 significantly.

 >>         doc.add_term(prefix+value)
 >
 > What's the value of having a term without any prefix if we are already
 calling index_text() for that value?
 index_text does word splitting, add_term does not.

 > See PEP8 about spaces around operators.
 Looks like neither pylint nor pep8.py check for this. I've changed this
 occurence, but may have missed others.

 >>         for (name, prefix) in _query_term_map.items():
 > Unneeded parens.
 Obviously a matter of taste. Changed.

 >>     def _parse_query_term(self, m_name, prefix, m_value):
 > No idea what m_ means here, please don't use abbreviations if at all
 possible.
 Missed one occurence during refactoring, fixed now.

 >> +            return Query(Query.OP_OR, [
 >> +                self._parse_query_term(m_name, prefix, word)
 >> +                for word in m_value])
 > Can you split this in several statements? That will also give you the
 chance to give names to expressions, thus making clearer what is going on
 there.
 Done.

 >> +                "Only tuples of size 2 have a defined meaning. " \
 >> +                "Did you mean to pass a list instead?")
 > Strings don't need \. If we can use ' everywhere instead of ", would be
 great.
 Changed.

 >> +            # compatibility option for timestamp: {'start': 0, 'end':
 1}
 > Does this mean that ranges in the query dict would get deprecated by the
 ranges in the query string?
 No, deprecated by ranges expressed using tuples. Unlike alsroot, I want to
 keep the dictionary-based interface and even see it as the primary
 interface for by most activities.

 >> +        except xapian.QueryParserError, exception:
 >> +            logging.warning("Invalid query string:
 "+exception.get_msg())
 >> +            return Query()
 > Hmm, if the exception was propagated, the Journal could say that an
 error has happened, similar to how we say today that the query returned no
 results. Wouldn't that be more useful feedback than returning the results
 of a blank query?
 I'm not sure, it should probably be discussed with the Design Team. BTW,
 Query() matches nothing at all, so isn't the same as passing an blank
 query into the datastore. So the user is already getting some kind of
 feedback.

 >> +        queries += [
 >> +            self._parse_query_term(name, prefix, query_dict.pop(name))
 >> +            for (name, prefix) in _query_term_map.items()
 >> +            if name in query_dict]
 > Please split this in several statements. Not only helps with readability
 but also keeps future patches more clear.
 The only way I could split this into several statements is by doing a
 regular for loop and adding each item individually using append(). This is
 probably slower, though I'm not sure if it'll make a real difference.

 >> +        self._query_parser = QueryParser()
 >> +        self._query_parser.set_database(self._database)
 >> +        self._term_generator = TermGenerator()
 > Are these operations expensive enough to cache those objects? Don't
 remember seeing them when I profiled the DS.
 It seemed wasteful to regenerate them every time and very easy to change.
 Haven't profiled the DS yet.

 > I'm trying to improve the review process docs, comments welcome:
 An automatic style checker that catches more of your stylistic complaints
 (string separator, spaces between operators, ...) would be great as it
 reduces the number of review rounds.

-- 
Ticket URL: <http://trac.sugarlabs.org/ticket/1090#comment:5>
Sugar Labs <http://sugarlabs.org/>
Sugar Labs bug tracking system


More information about the Bugs mailing list