[Bugs] #1090 UNSP: refactor indexing/query parsing, add support for prefixes
SugarLabs Bugs
bugtracker-noreply at sugarlabs.org
Sun Aug 9 15:13:18 EDT 2009
#1090: refactor indexing/query parsing, add support for prefixes
------------------------------------------+---------------------------------
Reporter: sascha_silbe | Owner: tomeu
Type: enhancement | Status: new
Priority: Unspecified by Maintainer | Milestone: Unspecified by Release Team
Component: sugar-datastore | Version: Git as of bugdate
Severity: Unspecified | Resolution:
Keywords: r? | Distribution: Unspecified
Status_field: New |
------------------------------------------+---------------------------------
Changes (by sascha_silbe):
* keywords: r- => r?
Comment:
Replying to [comment:4 tomeu]:
> I'm unsure about date ranges, do you really expect people will use them
without any hint about the expected syntax?
I don't expect them to use any non-basic query without proper
documentation. The DateRangeProcessor was for mostly for completeness and
can be considered an independent, new feature so I've removed it for now.
If you think it might still be useful, we can work on a better version
(e.g. with the Journal translating the date format) in a separate ticket.
> There's also the issue with the Journal appending a * to the query, how
do you plan to solve this?
It's been a PITA during testing; would like to switch to FLAG_PARTIAL
instead. The semantics are slightly different, though: FLAG_PARTIAL only
does an implicit wildcard match on the last word, not on all.
> Are we changing anything in what is stored in the index? If so, the
index needs to be rebuilt?
Yes and yes. I've not added code for doing the rebuild as I hope to get
the rest of my changes into mainline for 0.88 and these already do an
index rebuild because of the datastore layout changes.
> Any change in the size of the index?
It'll be significantly bigger. We currently store the "known" metadata
twice (see the comment quoted below for the reason - basically it's
required by Xapian and probably needs upstream changes to fix) and also
include position information. The latter is needed for "phrase" searching
to work, e.g. "mime_type:text/plain" ("text" and "plain" are considered
separate words). Also we store each value both word-split and as a literal
term (see below - to be changed before 0.88).
>> _query_term_map = {
> See PEP8 about the naming of constants.
Forgot to rename after moving from class to global. Fixed.
>> # add full value for dictionary-based searches
> ...
>> # index with and without prefix so specifying a prefix is
optional
>> # in query strings
> If you need comments to explain what your code does, consider
restructuring it so it's clearer, or use API comments instead (likely the
best option in this case).
I have reworded those comments to make it clear that they are explanations
of _why_ we are doing things this way, not of _what_ we are doing. I don't
think splitting the function up (which would be required to turn the
comments into docstrings) makes sense here.
>> # TODO: change query parser to generate phrase query instead
> Can you extend on this?
We want the user to be able to search for single words (e.g. for "tags"),
so we split words during indexing. But dictionary-based queries currently
use the passed value literally, i.e. without word splitting. So we need to
store it as a single, unsplit term as well. By changing our dictionary
query "parser" (rather: builder) to do word-splitting and using phrase
searching, we can drop that duplication, thereby reducing the storage size
significantly.
>> doc.add_term(prefix+value)
>
> What's the value of having a term without any prefix if we are already
calling index_text() for that value?
index_text does word splitting, add_term does not.
> See PEP8 about spaces around operators.
Looks like neither pylint nor pep8.py check for this. I've changed this
occurence, but may have missed others.
>> for (name, prefix) in _query_term_map.items():
> Unneeded parens.
Obviously a matter of taste. Changed.
>> def _parse_query_term(self, m_name, prefix, m_value):
> No idea what m_ means here, please don't use abbreviations if at all
possible.
Missed one occurence during refactoring, fixed now.
>> + return Query(Query.OP_OR, [
>> + self._parse_query_term(m_name, prefix, word)
>> + for word in m_value])
> Can you split this in several statements? That will also give you the
chance to give names to expressions, thus making clearer what is going on
there.
Done.
>> + "Only tuples of size 2 have a defined meaning. " \
>> + "Did you mean to pass a list instead?")
> Strings don't need \. If we can use ' everywhere instead of ", would be
great.
Changed.
>> + # compatibility option for timestamp: {'start': 0, 'end':
1}
> Does this mean that ranges in the query dict would get deprecated by the
ranges in the query string?
No, deprecated by ranges expressed using tuples. Unlike alsroot, I want to
keep the dictionary-based interface and even see it as the primary
interface for by most activities.
>> + except xapian.QueryParserError, exception:
>> + logging.warning("Invalid query string:
"+exception.get_msg())
>> + return Query()
> Hmm, if the exception was propagated, the Journal could say that an
error has happened, similar to how we say today that the query returned no
results. Wouldn't that be more useful feedback than returning the results
of a blank query?
I'm not sure, it should probably be discussed with the Design Team. BTW,
Query() matches nothing at all, so isn't the same as passing an blank
query into the datastore. So the user is already getting some kind of
feedback.
>> + queries += [
>> + self._parse_query_term(name, prefix, query_dict.pop(name))
>> + for (name, prefix) in _query_term_map.items()
>> + if name in query_dict]
> Please split this in several statements. Not only helps with readability
but also keeps future patches more clear.
The only way I could split this into several statements is by doing a
regular for loop and adding each item individually using append(). This is
probably slower, though I'm not sure if it'll make a real difference.
>> + self._query_parser = QueryParser()
>> + self._query_parser.set_database(self._database)
>> + self._term_generator = TermGenerator()
> Are these operations expensive enough to cache those objects? Don't
remember seeing them when I profiled the DS.
It seemed wasteful to regenerate them every time and very easy to change.
Haven't profiled the DS yet.
> I'm trying to improve the review process docs, comments welcome:
An automatic style checker that catches more of your stylistic complaints
(string separator, spaces between operators, ...) would be great as it
reduces the number of review rounds.
--
Ticket URL: <http://trac.sugarlabs.org/ticket/1090#comment:5>
Sugar Labs <http://sugarlabs.org/>
Sugar Labs bug tracking system
More information about the Bugs
mailing list