Festival is interesting but there's also the Sphinx project which is really worth to spend time with it. Unfortunately for me it works only for English language, so i'm out of the competiton as teaching (training) a language to Sphinx requires sound studio with heavy hardware and a lot of time. btw it's out of reach from French users but it's really efficient anyway.
It's strenght is far beyond the usual voice recognition systems. Have a look at it, compile it (it's very long), try it, it's an amazing tool that some U.S. army services aim to use for automated real time translation.
Some experiments have also been made on operations theathers to communicate with local populations.
About the fft
Jack server may be usefull for what you want to do, alsa+fftw, there's a bunch of tools to help reach this target, at least you will find spectrum analyser and gauges sources to use with jack.
First i thought about this software that this would add one more layer which would reduce the overall speed but it doesn't at all. It's fast and the results are pretty good with a small footprint