Starting Chotu - Part 1 Creating Keyword / WakeWord detection

Chotu is my pet project since 2016. I wanted to create a simple voice assistant which can take commands hindi and it’s name is chotu. As it’s a common name for child workers in India.

So a voice assistant which works in hindi. Some simple commands which should work

Activation Keyword - Chotu - It should start recording.
Chotu ——- laga de -> some music
Chotu ——- baja de -> some music
Chotu aaj mausam kaisa hai -> weather
Chotu main kya karun -> tell something random

Step 1 : Make an activation keyword identifier

Since in amazon alexa, the keyword identification is done on device so it has following constraints

It should be small enough for limited memory (How much)?
It should have limited parameters so that latency is low.

It should have

low False Rejection Rate - If we call chotu, it should be activated
low False Acceptance Rate - If we don’t call chotu, it should not be activated because we will send audio after chotu to internet so privacy reasons.

How to identify if chotu is called ?

Let’s assume we will operate on a 1 second sliding window. We want the device to listen constantly but not record any thing more than last 1 second. It can be assumed that “Chotu” word can be said in 1 second. Ideally, we want to have a lot of users say “Chotu” and see what is the time most users (90%) take to say the word. But for our small project, we will take 1 second.

Audio to Keyword detection

We have 1 second of audio. We would like to convert it to spectrogram, give it to a CNN and see if it matches with the spectrogram of Chotu.

So how do we create a spectrogram ? We will use MFCC - Mel frequency Cepstal coefficients. Here are the steps

Extract 25ms subsections with 10ms overlap
use DFT (Discrete Fourier Transform) to create a periodogram
Apply Mel spaced Triangle filters on periodogram to extract numerical features
There will be 26 features from a 25 ms section.
Take log of all features from all sections to create MFCC spectrogram

Each step is itelf a blog post and I will not go in detail here but below images can be used to understand the process. I would suggest to use google to get the whole process.

This can be done with one command in python, but we should understand the process.

Once we have the spectrogram we can use a CNN to send the spectrogram to identify whether it is of CHOTU or not.

Some problems:

Not all CHOTU words will be in the middle of the 1 sec audio clip. Some can be the start, some can be at the end. So we need to have multiple spectrograms with these possibilites.
What if there is noise in the background. Use mean subtraction to remove the noise.

So this is the basic idea of keyword detection and I would add code for it later.

Blog Archive

Archive of all previous blog posts

Blog Archive

Archive of all previous blog posts