[Sugar-devel] Datastore index corruption

Sascha Silbe sascha-ml-ui-sugar-devel at silbe.org
Sun Jun 20 06:52:57 EDT 2010


Excerpts from Bernie Innocenti's message of Sun Jun 20 00:33:50 +0000 2010:

> The journal was showing just one object, but the
> ~/.sugar/default/datastore directory contained 4-5 invisible entries.
When were these "invisible" entries written? Directly before a crash?
With the way the current data store works, it's easy for the index to
get "out of sync" with the metadata stored directly on disk (on crashes,
not during normal operation). Barring another rewrite, I can't think of
any way to prevent that from happening without slowing down to a crawl.
We already do the best we can by flushing the index every 20 changes and
60 seconds after the last change (see IndexStore._flush()).

FWIW, this is the most likely reason for your current problem. It would
probably pay off more to analyse the reasons for the crashes (including
power cycles) than trying to make the current data store more robust
against all scenarios.

If the laptops run out of power suddenly, maybe powerd could tell the
data store to go to a slower, more fail-safe mode (like the laptop-mode
script does on the lower layer). We would flush the Xapian index on
every change then, increasing the chance the index contains all entries
saved directly before the crash. IIRC automatic shut-down on low battery
has been decided against because it would reduce the maximum run-time.

If the kids power-cycle the laptops during (more or less) normal
operation, we should check why. One thing that bugged me was the lack of
busy-feedback. On an old-style PC, I would watch the HD LED and listen to
the hard disk to know whether the system is not reacting to my input
because it's busy doing something, or whether it crashed and I need to
reset it. Sugar on XO-1[.5] lacks all useful indicators (even the busy
cursor is almost unused), so how would a child decide whether to wait or
to power-cycle?

> There was no time to analyze the problem in detail,
:(
We can't do much to improve the robustness of the current data store
without knowing exactly what caused it to break in the field. Maybe you
could do a dd of mtdblock0 to a file on a USB stick the next time? If
possible from a system booted from the stick (so it's not mounted at
the time of the dump) - you could even automate it then.

>  * the corruption could be caused by flash problems. I have found
>    laptops in the field that wouldn't boot because /sbin/lvm was
>    corrupted
There's nothing the data store can do against this type of problem (it
wouldn't even be able to start up if its program code is corrupted). This
needs to be fixed at a lower layer (which would essentially boil down to
making everything redundant, thus halving the available space).

>  * we can't exclude jffs2 problems too: when it's almost full, it does
>    slow garbage collection passes on boot which kids interrupt by
>    power cycling. I wonder how robust jffs2 is in this case.
I wonder if UBIFS is better in this regard. Flash doesn't seem to last as
long as with JFFS2 [1], but maybe it handles crashes better? (I don't
know much about either JFFS2 or UBIFS myself, so I can't tell)

>  * there might be a bug in xapian. If so, we'll see this issue also
>    on the XO-1.5
Unlikely as I've never seen it happen in testing. But not impossible, of
course. Especially since I tend to run the latest upstream version which
will already have more bugs fixed.

>  * I'm skeptical it's a new issue in 0.84 or F-11: the older builds
>    had so many data loss issues that a subtler problem like this
>    could have easily gone unnoticed.
The data store has been rewritten from scratch for 0.84+. Only bugs on
a lower layer (e.g. JFFS2) would apply to both data stores.

>  * can the datastore detect index corruption in the most obvious cases?
>    If so, what would it do?
If it's corrupted badly enough Xapian will detect it and we will do a
full index rebuild.
Xapian couldn't effectively guard itself against subtle corruptions
without doing checksumming of each and every data block. That belongs
on a lower layer (md, file system), not in applications.

>  * how long does it take to rebuild the index on a busy journal?
Long enough to cause the kids to power-cycle again. It would start over
on next boot, but if the issue we try to fix with the index rebuild was
on a lower layer, what will happen on that reboot?

>  * finally, if we can't find a 100% robust solution, would it make
>    sense to add a "Reindex Journal" button somewhere?
The user can just delete the index_updated file and restart Sugar. If
the laptop crashes often enough to warrant a button, we failed badly.

Sascha

[1] http://dev.laptop.org/~wad/nand/
--
http://sascha.silbe.org/
http://www.infra-silbe.de/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 490 bytes
Desc: not available
Url : http://lists.sugarlabs.org/archive/sugar-devel/attachments/20100620/ccfc8f4a/attachment.pgp 


More information about the Sugar-devel mailing list