[sugar] [Wikireader] english wikireaders and 0.7
Samuel Klein
sj
Sun Sep 7 19:57:47 EDT 2008
How is the current list of articles being generated?
On Sun, Sep 7, 2008 at 7:22 PM, Chris Ball <cjb at laptop.org> wrote:
> Hi,
>
> > Where is the code for this? Lede-detection code is a priority for
> > me, and I'd like to work on it. It should be easy to sense the
> > start of the first H2 and drop the rest of the article.
>
> There is no code for lead detection. You'd have to write it from
> scratch; take the enwiki.xml.bz2 from ?, run it through your script,
> and output a new enwiki.xml.bz2 with articles substituted for leads
> if the article isn't present in ?.
>
> ?: http://download.wikimedia.org/enwiki/20080724/enwiki-20080724-pages-meta-current.xml.bz2
> ?: http://dev.laptop.org/~cjb/enwiki/en-g1g1-8k.txt
Now I can't tell whether this was humour ;-)
There's already been a subset made for the 27,000 articles in
Wikipedia 0.7 , thanks to Martin and CBM -- which is the set we are
talking about. So no need to download and parse 8GB of xml.
I meant, which is the code that takes a list of articles and extracts
that subset of a larger xml file? If I know the
single-file-id/extraction routine can run a size-reduction script on
that article before moving it from the larger xml collection to the
smaller one.
> This sounds complicated, and we don't try to do it for pages. Templates
> are probably on average the same size as each other (or rather, they're
> all small enough that the difference is not very meaningful); find out
> the size of an average-looking one and how much disk space we have left,
Oh, right - they are still being transcluded dynamically. That's what
I was asking about, forgetting that mwlib conserves space here. That
seems to me a reason to include many templates -- the good ones
enhance dozens of articles.
SJ
More information about the Sugar-devel
mailing list