[Bugs] #1090 UNSP: refactor indexing/query parsing, add support for prefixes
SugarLabs Bugs
bugtracker-noreply at sugarlabs.org
Tue Aug 11 12:48:02 EDT 2009
#1090: refactor indexing/query parsing, add support for prefixes
------------------------------------------+---------------------------------
Reporter: sascha_silbe | Owner: tomeu
Type: enhancement | Status: new
Priority: Unspecified by Maintainer | Milestone: Unspecified by Release Team
Component: sugar-datastore | Version: Git as of bugdate
Severity: Unspecified | Resolution:
Keywords: r- | Distribution: Unspecified
Status_field: New |
------------------------------------------+---------------------------------
Changes (by tomeu):
* keywords: r? => r-
Comment:
Replying to [comment:5 sascha_silbe]:
> Replying to [comment:4 tomeu]:
> > I'm unsure about date ranges, do you really expect people will use
them without any hint about the expected syntax?
> I don't expect them to use any non-basic query without proper
documentation. The DateRangeProcessor was for mostly for completeness and
can be considered an independent, new feature so I've removed it for now.
If you think it might still be useful, we can work on a better version
(e.g. with the Journal translating the date format) in a separate ticket.
Well, I cannot really say myself because are far from sugar-using
children. We should ask in the mailing list for opinions.
> > There's also the issue with the Journal appending a * to the query,
how do you plan to solve this?
> It's been a PITA during testing; would like to switch to FLAG_PARTIAL
instead. The semantics are slightly different, though: FLAG_PARTIAL only
does an implicit wildcard match on the last word, not on all.
That will change the UI, I remember Marco working on this closely with
Eben. Can you get feedback from people on this issue? Otherwise they are
going to say we broke the search field ;)
> > Are we changing anything in what is stored in the index? If so, the
index needs to be rebuilt?
> Yes and yes. I've not added code for doing the rebuild as I hope to get
the rest of my changes into mainline for 0.88 and these already do an
index rebuild because of the datastore layout changes.
Sounds good.
> > Any change in the size of the index?
> It'll be significantly bigger. We currently store the "known" metadata
twice (see the comment quoted below for the reason - basically it's
required by Xapian and probably needs upstream changes to fix) and also
include position information. The latter is needed for "phrase" searching
to work, e.g. "mime_type:text/plain" ("text" and "plain" are considered
separate words). Also we store each value both word-split and as a literal
term (see below - to be changed before 0.88).
I guess deployers would be able to help evaluate this. Can you ask in the
mailing list addressing specifically dsd and martin langhoff? I guess they
will ask you for some numbers. Would be good to know how it impacts
performance on the XO.
> >> # TODO: change query parser to generate phrase query instead
> > Can you extend on this?
> We want the user to be able to search for single words (e.g. for
"tags"), so we split words during indexing. But dictionary-based queries
currently use the passed value literally, i.e. without word splitting. So
we need to store it as a single, unsplit term as well. By changing our
dictionary query "parser" (rather: builder) to do word-splitting and using
phrase searching, we can drop that duplication, thereby reducing the
storage size significantly.
Any downsides? How hard would it be to implement that?
> >> + # compatibility option for timestamp: {'start': 0,
'end': 1}
> > Does this mean that ranges in the query dict would get deprecated by
the ranges in the query string?
> No, deprecated by ranges expressed using tuples. Unlike alsroot, I want
to keep the dictionary-based interface and even see it as the primary
interface for by most activities.
Fine with me, though I don't see much value in changing that.
> >> + except xapian.QueryParserError, exception:
> >> + logging.warning("Invalid query string:
"+exception.get_msg())
> >> + return Query()
> > Hmm, if the exception was propagated, the Journal could say that an
error has happened, similar to how we say today that the query returned no
results. Wouldn't that be more useful feedback than returning the results
of a blank query?
> I'm not sure, it should probably be discussed with the Design Team. BTW,
Query() matches nothing at all, so isn't the same as passing an blank
query into the datastore. So the user is already getting some kind of
feedback.
You are right. If you make it raise an exception and enter a ticket about
it, I will update the journal.
> >> + queries += [
> >> + self._parse_query_term(name, prefix,
query_dict.pop(name))
> >> + for (name, prefix) in _query_term_map.items()
> >> + if name in query_dict]
> > Please split this in several statements. Not only helps with
readability but also keeps future patches more clear.
> The only way I could split this into several statements is by doing a
regular for loop and adding each item individually using append(). This is
probably slower, though I'm not sure if it'll make a real difference.
I think that's fine, readability comes before non-measured performance.
> >> + self._query_parser = QueryParser()
> >> + self._query_parser.set_database(self._database)
> >> + self._term_generator = TermGenerator()
> > Are these operations expensive enough to cache those objects? Don't
remember seeing them when I profiled the DS.
> It seemed wasteful to regenerate them every time and very easy to
change. Haven't profiled the DS yet.
Not a big deal, but trading memory usage for reduced cpu usage before
measuring doesn't seem like a good idea. Also all variables should have
the smallest scope they have, for higher cohesion.
> > I'm trying to improve the review process docs, comments welcome:
> An automatic style checker that catches more of your stylistic
complaints (string separator, spaces between operators, ...) would be
great as it reduces the number of review rounds.
True, I agree it sucks right now, but if you go browse the gnome and
mozilla bugzillas for code reviews, you will see that people are expected
to be quite careful about style on their first patches. And I have never
seen a C/C++ project that had tools to check style, they just have a page
explaining their code style guidelines:
http://developer.gnome.org/doc/guides/programming-guidelines/book1.html
https://developer.mozilla.org/En/Mozilla_Coding_Style_Guide
I'm personally ok with the patch, if you get favorable feedback from UI
and OLPC people, we can push it as-is.
--
Ticket URL: <http://trac.sugarlabs.org/ticket/1090#comment:6>
Sugar Labs <http://sugarlabs.org/>
Sugar Labs bug tracking system
More information about the Bugs
mailing list