[sugar] [Wikireader] english wikireaders and 0.7

Samuel Klein sj
Sun Sep 7 19:57:47 EDT 2008


How is the current list of articles being generated?

On Sun, Sep 7, 2008 at 7:22 PM, Chris Ball <cjb at laptop.org> wrote:
> Hi,
>
>   > Where is the code for this?  Lede-detection code is a priority for
>   > me, and I'd like to work on it.  It should be easy to sense the
>   > start of the first H2 and drop the rest of the article.
>
> There is no code for lead detection.  You'd have to write it from
> scratch; take the enwiki.xml.bz2 from ?, run it through your script,
> and output a new enwiki.xml.bz2 with articles substituted for leads
> if the article isn't present in ?.
>
> ?:  http://download.wikimedia.org/enwiki/20080724/enwiki-20080724-pages-meta-current.xml.bz2
> ?:  http://dev.laptop.org/~cjb/enwiki/en-g1g1-8k.txt

Now I can't tell whether this was humour ;-)

There's already been a subset made for the 27,000 articles in
Wikipedia 0.7 , thanks to Martin and CBM  -- which is the set we are
talking about.  So no need to download and parse 8GB of xml.

I meant, which is the code that takes a list of articles and extracts
that subset of a larger xml file?  If I know the
single-file-id/extraction routine can run a size-reduction script on
that article before moving it from the larger xml collection to the
smaller one.


> This sounds complicated, and we don't try to do it for pages.  Templates
> are probably on average the same size as each other (or rather, they're
> all small enough that the difference is not very meaningful); find out
> the size of an average-looking one and how much disk space we have left,

Oh, right - they are still being transcluded dynamically.  That's what
I was asking about, forgetting that mwlib conserves space here.  That
seems to me a reason to include many templates -- the good ones
enhance dozens of articles.

SJ



More information about the Sugar-devel mailing list