NB: I will take my time creating this tutorial, as my typing speed is slower due to my wrist injury. I will be able to type much faster when my plaster comes off in about 2 weeks' time.
A couple of months ago, I found this link describing the USB protocol for a robotic arm that I spotted in a shop previously. A few weeks ago when I suffered a minor wrist injury and temporarily lost the use of my right hand, I decided to assemble the robot arm (decidedly a slow process with just one hand) and test out notbrainsurgery's code. It didn't take me long to realise that the most convenient way to control it was through voice commands rather than typing in command line arguments.
So what do you need to build this voice-controlled robot arm?
- Robot arm (from OWI, or Maplin if you're in the UK)
- computer running Linux
- Hidden Markov Model tool kit (HTK) from Cambridge university
- Julius speech recognition decoder developed by CSRC, Japan
- recording software (Audacity, Sound Recorder, etc.)
- libusb version 1.0
- a quiet environment to record voice samples
NB: While all of the packages are practically open-source, HTK contains restrictions on its license restricting it only to personal and research use. I don't think it's an OSI-approved license so technically it can't be called open source. However, we only use HTK to train the model, so once the model is created the restrictions do not extend to the model.
Theoretically you can do this on a computer running Windows or OSX, however it is generally easier on Linux (and please, if you decide to try this on OSX or Windows, don't ask me for help - google is your friend). You can use any flavour of Linux. I'm currently using Ubuntu 11.04, but any distro with GCC and a good package manager will do.
This tutorial is mostly based on the Voxforge howto. I have added additional details and links, but if you really want to go into the details of the acoustic models, then the Voxforge tutorial is an excellent place to learn.
Creating a vocabulary and grammar
The first step is deciding on a command vocabulary. The robotic arm has 3 joints, a manipulator as well as a rotating base. Additionally it has a white LED for illumination at the manipulator.
I have chosen the vocabulary to consist of:
I then use this to create my robot's vocabulary file (sample.voca) in HTK format:
WRIST r ih s t
ELBOW eh l b ow
SHOULDER sh ow l d er
LIGHT l ay t
ON aa n
OFF ao f
GRIP g r ih p
OPEN ow p ax n
CLOSE k l ow s
UP ah p
DOWN d aw n
LEFT l eh f t
RIGHT r ay t
Each word needs to be broken into its phonemes - that is, each unit of sound in the word. English is usually quoted to have 44 phonemes, and you can use the Voxforge or the CMU phoneme dictionary to determine the phonemes in your vocabulary. Now what are the headings "% JOINT_N" and others for? They specify the word type which we use for creating the command grammar, as I describe in the section below.
Creating a grammar
Now we need to create a grammar. Languages can have complex grammars, but our robot just needs to understand a small subset. For now, I've set the possible sentences to be:
- Joint + up | down
- Grip + open | close
- Light + on | off
- Left | Right
Where Joint can refer to either shoulder, elbow or wrist. What this means is that a sentence could be one of "Elbow up" or "Elbow down" but not "Shoulder open". Grammars are formally described using Backus-Naur Form, or BNF for short. It's up to you how to structure the robot's grammar, however I've kept it simple for a start. Here is the grammar file (sample.grammar) for my robotic arm in the HTK format:
S : NS_B SENT NS_E SENT: JOINT_N DIRECTION SENT: MANIP M_ACT SENT: DEV DEV_IN SENT: ROTATION
Training the Model
To train the acoustic model, you need to record your voice samples. Software such as Audacity can come in useful. Make sure that you record your voice in a reasonably quiet environment, in a normal talking speed and volume. About 0.5 seconds of silence at the beginning and at the end is recommended. The Voxforge documentation mentioned earlier has more details in how to record voice samples.
Make sure that the sampling rate is set to 48000 Hz, with 16 bits per sample and the channel is set to mono. Export the files as WAV (Microsoft 16 bit PCM).
In summary, you need to record sample1.wav, sample2.wav... and so on, one wav file per line in the sample.prompts file below. If you add more words in your vocabulary, then it is highly recommended that you put those words in the prompts file.
*/sample1 OPEN OPEN CLOSE CLOSE OPEN OPEN CLOSE CLOSE
*/sample2 GRIP WRIST SHOULDER ELBOW LEFT RIGHT OPEN CLOSE UP DOWN
*/sample3 LEFT WRIST RIGHT WRIST LEFT WRIST RIGHT WRIST
*/sample4 UP UP DOWN DOWN OPEN OPEN CLOSE CLOSE
*/sample5 SHOULDER SHOULDER ELBOW ELBOW GRIP GRIP
*/sample6 ELBOW UP ELBOW DOWN SHOULDER UP SHOULDER DOWN
*/sample7 GRIP OPEN GRIP CLOSE GRIP OPEN GRIP CLOSE
*/sample8 LEFT LEFT RIGHT RIGHT LEFT LEFT RIGHT RIGHT
*/sample9 LIGHT RIGHT LIGHT RIGHT LIGHT RIGHT
*/sample10 ON OFF ON OFF ON OFF ON OFF
*/sample11 BOOKENDS KENNEL KENNETH KENYA WEEKEND
*/sample12 BELT BELOW BEND AEROBIC DASHBOARD DATABASE
*/sample13 GATEWAY GATORADE GAZEBO AFGHAN AGAINST AGATHA
*/sample14 ABALON ABDOMINALS BODY ABOLISH
*/sample15 ABOUNDING ABOUT ACCOUNT ALLENTOWN
*/sample16 ACHIEVE ACTUAL ACUPUNCTURE ADVENTURE
*/sample17 ALGORITHM ALTHOUGH ALTOGETHER ANOTHER
*/sample18 BATTLE BEATLE LITTLE METAL
*/sample19 BITTEN BLATANT BRIGHTEN BRITAIN
*/sample20 BROOKHAVEN HOOD BROUHAHA BULLHEADS
*/sample21 BUSBOYS CHOICE COILS COIN
*/sample22 COLLECTION COLORATION COMBINATION COMMERCIAL
*/sample23 MIDDLE NEEDLE POODLE SADDLE
*/sample24 ALRIGHT ARTHRITIS BRIGHT COPYRIGHT CRITERIA RIGHT
*/sample25 COUPLE CRADLE CRUMBLE
*/sample26 CUBA CUBE CUMULATIVE
*/sample27 CURING CURLING CYCLING
*/sample28 CYNTHIA DANFORTH DEPTH
*/sample29 DIGEST DIGITAL DILIGENT
*/sample30 AMNESIA ASIA AVERSION BEIGE BEIJING
*/sample31 HELP HELLO HELMET HELPLESS AHEAD HELP
Putting everything together
As per the Voxforge documentation, create a folder 'voxforge' in your home directory, as well as 'voxforge/auto' and 'voxforge/HTK_scripts'. Also create 'voxforge/auto/train', 'voxforge/auto/train/wav', 'voxforge/auto/train/mfc' and 'voxforge/lexicon'
- place sample.grammar, sample.voca, prompts in voxforge/auto
- put all the wav files (sample1.wav ....etc) into voxforge/auto/train/wav
- unzip the Hidden Markov Model tool kit (the HTK_samples archive) and put the following HTK scripts into voxforge/HTK_scripts:
- mkclscript.prl (from /samples/RMHTK/perl_scripts/)
- put voxforge_lexicon into voxforge/lexicon
Now in the voxforge/auto folder, execute:
In the voxforge/auto folder, create a file called codetrain.scp and fill it with:
Make sure there is one line corresponding to each wav file you have created. This is a list of all the wav files and the MFCC file (an HTK format) it will be converted to.
Download and extract the following scripts in your 'voxforge/auto/scripts' folder: (download from Voxforge)
Then, in the voxforge/auto/scripts folder, execute:
If it finishes without errors, you can now proceed to the Tutorial part 2!