[Sugar-devel] Speech Recognition in Sugar: GSOC 2010 Proposal

chirag jain chiragjain1989 at gmail.com
Tue Mar 30 11:24:52 EDT 2010

Previous message: [Sugar-devel] SoaS 3 Activity list
Next message: [Sugar-devel] Speech Recognition in Sugar: GSOC 2010 Proposal
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi everyone,

Sugar is becoming a useful educational tool with more and more features
being introduced. Like Speech Synthesis which is the base of many sugar
activities, I would like to integrate Speech Recognition into core sugar. I
am presenting my features and implementation ideas and would like to know
the feedback of sugar community.

*Intended audience*

With speech recognition system, we will be fulfilling the needs of two types
of audience; one is the end users who are not technical and others are
activity developers.

*End users (Non technical)*

For end users Speech recognition can act as a medium for controlling the
sugar. Now imagine a child who is physically challenged and thus is not able
to interact with systems can now open the activities (like write activity)
by just saying “Open write activity”. Then he/she can simply interact with
the activity with speech recognition running in the background by just
saying simple commands. For example, he/she can start typing by saying
“Start typing” and then speaking the words that they want to write into the
document. Thus sugar will become accessible for physically challenged users
which will be a boon to them.

Activities can be developed around speech recognition that can help children
to improve their pronunciation by incorporating oral testing. Oral testing
is a method to provide feedback to the users on their pronunciation by
recognizing their speech. Thus a child speaking a word “Apple” correctly
should be recognized otherwise not. This is only one example and we can
create numerous activities around speech recognition. This will make
possible to develop more interactive activities for children that can help
make sugar a useful educational tool.

*Activity Developers (Technical)*

Activity developers would be primarily interested in the API’s provided for
speech recognition. We will provide simple and easy to use interfaces for
the developers that will have all the control over speech recognition. The
developers of already existing activities can also integrate speech
recognition to make them more useful. Consider for example the write
activity, we can modify it to take the inputs for typing from Speech
recognition system instead of the keyboard.

*Implementation details*

For a speech recognition system, we require a Speech recognition engine that
can be integrated into sugar over which we can develop the entire framework.
The major requirements of such an engine are:

- 1. It should be capable of running on Linux which is the core of
sugar.
- 2. It should be open source so that we can modify it accordingly
as per our needs and requirements.
- 3. It should not consume a lot of memory during run time.
- 4. It should be an efficient speech recognizer.

One such Speech recognition engine that nearly fulfills all of these
requirements is Sphinx. Sphinx is an open source speech recognition engine,
developed at CMU is one of the top class speech recognizer. It has been
developed primarily for Linux and comes under different versions.
http://www.speech.cs.cmu.edu/

The currently available versions are:

- 1. Sphinx 3
- 2. Pocket Sphinx
- 3. Sphinx 4

Sphinx 4 is the latest version which has been developed entirely in JAVA.
Sphinx 3 and pocket sphinx are older versions but still are the famous ones.
Using Sphinx 4 for integration in sugar does not seem feasible because it
has been written in JAVA. So we are left with two options of either using
Sphinx 3 or Pocket Sphinx. Now the decision between these two can only be
made by experimenting them with sugar. This will also depend on the devices
currently being aimed by sugar and thus the main focus will be on OLPC XO
laptops. The XOs have 256 MB of RAM and the run time requirement of Pocket
Sphinx is around 20 MB. At this time I am not sure about the requirements of
Sphinx 3 but this should be more than 30 MB. Pocket Sphinx is light weight
and is designed primarily for embedded devices like PDA. Sphinx 3 on the
other hand is developed to run on desktops and consumes considerable amount
of memory. So at least Pocket Sphinx can be implemented in sugar and the
feasibility of Sphinx 3 will be tested soon.

*Language Support*

Sphinx engines require training data sets and language models for
recognizing speech. Thus we can set them to recognize many languages. At
present they have been tested for recognizing Chinese, Spanish, Dutch,
German, Hindi, Italic, Icelandic and Russian successfully. Thus we can
target a wide range of users belonging to different parts of world speaking
different languages. I have collected all this data after discussion with a
Sphinx developer on IRC and I am testing the Sphinx 3 and Pocket sphinx too.

*GUI considerations*

We can provide a Speech recognition button in the sugar frame (for example
on Top Right hand side) which when clicked will automatically start
recognizing speech in the background. Clicking the same button again will
stop the recognition process. On hovering over the Speech recognition
button, a sugar palette will be exposed which will display the speech
recognition parameters that can be modified by the user. Sugar Controls like
Sliders, Palette Buttons, and Combo boxes will be used within the palette to
achieve the desired effect.

A keyboard shortcut like <Alt+S> can also be provided for starting speech
recognition. The corresponding hooks for the key shortcut must be made in
the Sugar UI source code.

I am leaving out more details that I will put up on the sugar wiki after
getting the feedback from community.

*My little introduction **J***

I am a computer science undergraduate student at Delhi University, New
Delhi. I have always been influenced by the development at sugar and I would
definitely like to contribute more into it.

I have been working as a developer at SEETA (Software for Education,
Entertainment and Training Activities), New Delhi, India for last 10 months.
http://seeta.in/j/team.html

I am also the lead developer of Listen and Spell activity.
http://activities.sugarlabs.org/en-US/sugar/addon/4234

Waiting for some exciting feedback from community.

Regards

Chirag
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.sugarlabs.org/archive/sugar-devel/attachments/20100330/402551e1/attachment-0001.htm

Previous message: [Sugar-devel] SoaS 3 Activity list
Next message: [Sugar-devel] Speech Recognition in Sugar: GSOC 2010 Proposal
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Sugar-devel mailing list