Chotu is my pet project since 2016. I wanted to create a simple voice assistant which can take commands hindi and it’s name is chotu. As it’s a common name for child workers in India.

So a voice assistant which works in hindi. Some simple commands which should work

  1. Activation Keyword - Chotu - It should start recording.
  2. Chotu ——- laga de -> some music
  3. Chotu ——- baja de -> some music
  4. Chotu aaj mausam kaisa hai -> weather
  5. Chotu main kya karun -> tell something random

Step 1 : Make an activation keyword identifier

Since in amazon alexa, the keyword identification is done on device so it has following constraints

  1. It should be small enough for limited memory (How much)?
  2. It should have limited parameters so that latency is low.

It should have

  1. low False Rejection Rate - If we call chotu, it should be activated
  2. low False Acceptance Rate - If we don’t call chotu, it should not be activated because we will send audio after chotu to internet so privacy reasons.

How to identify if chotu is called ?

Let’s assume we will operate on a 1 second sliding window. We want the device to listen constantly but not record any thing more than last 1 second. It can be assumed that “Chotu” word can be said in 1 second. Ideally, we want to have a lot of users say “Chotu” and see what is the time most users (90%) take to say the word. But for our small project, we will take 1 second.

Audio to Keyword detection

We have 1 second of audio. We would like to convert it to spectrogram, give it to a CNN and see if it matches with the spectrogram of Chotu.

So how do we create a spectrogram ? We will use MFCC - Mel frequency Cepstal coefficients. Here are the steps

  1. Extract 25ms subsections with 10ms overlap
  2. use DFT (Discrete Fourier Transform) to create a periodogram
  3. Apply Mel spaced Triangle filters on periodogram to extract numerical features
  4. There will be 26 features from a 25 ms section.
  5. Take log of all features from all sections to create MFCC spectrogram

Each step is itelf a blog post and I will not go in detail here but below images can be used to understand the process. I would suggest to use google to get the whole process.

This can be done with one command in python, but we should understand the process.

Once we have the spectrogram we can use a CNN to send the spectrogram to identify whether it is of CHOTU or not.

Some problems:

  1. Not all CHOTU words will be in the middle of the 1 sec audio clip. Some can be the start, some can be at the end. So we need to have multiple spectrograms with these possibilites.
  2. What if there is noise in the background. Use mean subtraction to remove the noise.

So this is the basic idea of keyword detection and I would add code for it later.


<
Blog Archive
Archive of all previous blog posts
>
Blog Archive
Archive of all previous blog posts