Have you ever felt the need to track your speaking habits, control your home with your voice, or simply record from multiple microphones at once? Then you're in luck! In this Instructable, you will learn how install and use an array of multiple microphones to simultaneously capture speech from individual rooms as text.
Like this Instructable? Don't forget to follow us, favorite it, and check us out on Facebook!
To do this, we'll be using the GStreamer media library and CMU's PocketSphinx speech-to-text utility, running with Python 2.7 on Ubuntu 14.04. This setup is extensible - if you're not looking for speech-to-text and instead want to do some other audio processing, GStreamer has a wide array of plugins that can be hooked up to your multi-microphone array to do recording, audio level monitoring, frequency analysis, and more!
Total Cost (per microphone):
$30 wired, or $90 wireless
You Will Need (per microphone):
One quick note on the optional wireless system - there are a lot cheaper ways to do wireless microphone audio (we personally tried out this clip-on announcer microphone setup for $13 total), but they're often limited to a single channel. If you buy multiple of these, they'll broadcast to all receivers and you won't get that awesome per-room granularity.
Let's get to work!
Step 1: Preparing the Microphones
An Unfortunate Circumstance:
When our eBay-sourced USB audio dongles arrived, we were excited to finally have multiple microphone jacks! But this was short-lived when we realized that the audio immediately cut out when we plugged any of our active microphones into the jack.
It turns out the problem is related to function - the mono microphone jacks are designed to be plugged into a stereo recording system, and thus grounds the right channel so no audio is heard on that channel. The USB audio dongles, however, have mono microphone jacks that short the left and right channels together so that any stereo inputs are recorded as mono without one of the channels being removed. The result? Both channels are grounded, and the audio reading is a flat line.
Luckily, the solution is easy. We popped open the case of one of our dongles, located the audio jack, and snipped the leads off the right audio channel position (the "ring") to break the circuit. Now the USB dongle only records from the left audio channel, which is what the active microphones transmit.
You'll likely have to do the same - if your mic is silent via the USB dongle but works with the audio jack on your PC, crack open the case and clip the middle lead on the microphone jack to prevent the dongle from mixing the left and (grounded) right audio channels.
After making the modifications to the USB dongles, the rest isn't hard to set up. We did little more than open the box, install a single AA battery, and flip the mic switch to "tele" to get each microphone operational. The "tele" mode records sound in a more narrow cone, which reduces noise from the sides - there's also a "normal" mode which acts like a regular omnidirectional microphone, but our setup worked better using a narrow field.
After installing the battery and switching it on, do the following for each microphone:
We recommend installing Audacity and recording from each microphone individually to make sure everything works. You'll likely have a list of USB sound device entries in your microphone input list - ours worked when we selected "Input Device - USB PnP Sound Device: Audio (hw:2,0): Mic:0".
Onward to installation!
Step 2: Placement and Wireless
With the microphones sending audio and the USB audio jacks receiving audio, we're ready to install our mics into the rooms themselves. However, installing a microphone that can hear an entire room is a bit of an art form - even with the correct type of microphone, placement itself is often done through common sense and experimentation. Here's a few tips that we think may help you when installing your microphones:
For our own setup (seen in the video), we put a shotgun microphone on a small table in the corner that faces our couch and memory foam blob that people commonly gather around. We also had some extra lapel microphones that we installed in front of each of our computers, since we knew we would be sitting in front of them very often.
We did some additional installation in the starting room of the video - this room was pretty far away from our PC. We didn't want to run a cable between them, so we made our mics wireless!
Rather than plugging the gold audio adapter directly into the USB audio dongle, instead plug it into the bluetooth transmitter and stick the receiver into the audio dongle's mic jack. The modules we bought have their jacks inverted, so use those male-to-male and female-to-female adapters to connect it all. To pair the bluetooth modules, press and hold the single button on each until both are flashing. When both modules light up in a solid color, they're paired and ready to use. You can then test the entire system as usual.
We found that adding bluetooth to our setup introduced about 100ms of delay, but for speech-to-text this isn't a huge problem. It didn't diminish transcription accuracy at all.
Next, we'll tackle software installation.
Step 3: The Software
Now you'll need a few software packages installed before we can run our code. Copy and paste the following command into your terminal to do this:
sudo apt-get update && sudo apt-get install git python-gst0.10 gstreamer0.10-pocketsphinx
You'll also need some boilerplate code to get everything running - we've provided it here. after navigating to where you want the files to be, install it with this command:
git clone https://github.com/smartin015/MultiRoomSTT.git
Finally, change directory to MultiRoomSTT and run the main file:
The script will list all of the audio input devices it can find. Press enter, and a bunch of setup text will run by. Speaking into any of the microphones you've set up will display a bunch of lines of the following format:
These are partial transcription results - the translation ID remains the same until the transcription completes, at which point you'll see a line beginning with "###" to indicate a complete transcription. The audio ID indicates which audio device is being transcribed, and the transcription indicates what Sphinx thought it heard through the microphone.
If you've made it this far, congratulations! You've got a working multi-microphone speech-to-text setup. But we're sure you'd like better transcription accuracy, and perhaps a look under the hood at the python script.
Read on to find out more about both!
Step 4: Improvements and Modifications
We noticed that the default transcription done by Sphinx is, simply put, terrible. Luckily, it's easy enough to fix this by creating a custom language model. This can restrict the possible range of identified words, resulting in fewer mistranslations.
Follow the instructions in the link above to generate your language model, and download the created files to the MultiRoomSTT folder. Next, open up main.py and provide the absolute paths to each file in the LM_PATH and DICT_PATH variables. the script use these language models the next time it is run. You should see a huge increase in correct transcriptions as long as the speech being transcribed only uses words from this model.
How the Code Works:
When main.py first runs, it looks for a list of audio sources via the source_discovery.py script. That script runs the terminal command "pacmd list-sources" and parses out the audio sources from the results, keeping the name, ID, and bus path of input devices only. The name is useful for human readability, and the ID is what we use to identify which audio source to record from.
The bus path field is the most interesting - it contains information about which port the input device is plugged into and can be referenced to determine which room a given USB device is recording from. For instance, if you want to grab audio from your living room and you know the dongle is plugged into port 6 of the USB hub which is plugged into port 4 on your computer, look for "usb-0:4.6:1.0" in the bus path and you'll find the ID of your dongle. Bus path is persistent across plugs/unplugs and reboots, so you don't have to keep trying random device IDs to find the microphone you're looking for.
After gathering source information, the script then creates a SpeechParser object for each audio source and runs them in a main loop. The SpeechParser class abstracts away all the messy GStreamer code: setting up the pipeline, setting callback properties, and linking the audio source to PocketSphinx.
Once the main loop is running, the pipelines will listen on their respective audio sources and pass them through Sphinx (more info on that here). When Sphinx is in the middle of transcribing a string of phonemes, it sends callbacks through SpeechParser to the passed partial_cb function with what it thinks it heard. When the microphone detects silence and Sphinx finishes parsing the speech, the result is passed through SpeechParser to final_cb.
Step 5: Going From Here
We hope you've enjoyed our instruction on building a multi-room speech-to-text transcription device. We're using this setup for a voice-controlled home automation system, but there are plenty of other situations where this may be useful, such as:
We've got a lot of exciting new content on the way this summer, so if you enjoyed this Instructable, don't forget to favorite it and follow us on Facebook!