Hardware solution is interesting indeed, it offers straight targetting but it loses scalability while software solution is more flexible.
Sphinx2 is really able to operate voice recognition without you previously train it, in other words, it works out of the box, at least for english and spanish language, this means that it recognizes not the voice wave as many competitors do, it make a real sentence content analysis.
You can find more informations about the Sphinx project at
http://cmusphinx.sourceforge.net/
Now there is another axis which can enter in conjunction with voice recognition, it's the lip's reading, it's known under the multimodal recognition. I know that there were some experiments with Sphinx but i saw this one or two years ago, i don't know if some progress has been done, probably yes

The advantage of this technic is that it allows far best recognition because it compares what has been recognised with sound and what has been recognized with lip's reading.
It works with models of mouth's shape through vectors. Camera is watching you and make it's own recognition.
Last thing, i can suggest beginners to try Sphinx2 and perlbox-voice, it can give a good approach.