Implementing a real-life example of things that I’m trying to comprehend, is the most exciting part of the learning process for me. It gives me a chance to encounter with dark corners and pitfalls. Accordingly, I’m going to briefly write up about a recent project I have done on simple audio classification using machine learning. In this project, I will implement a machine learning model to classify fire alarm, vacuum cleaner, music, and silence sounds.

Data Collection

Every machine learning pipeline starts with a data collection and prepration stage. I have collected audio samples in .WAV format using the Audacity desktop application on the windows with 44.1 khz sample rate. Note that we consider silence as a seperate class. Each sample is roughly 30 seconds. For music, fire alarm, and vacuum cleaner, I recorded samples from online audio. For each class, I collected at least 20 samples, in different sessions with different ambient noise and background to have a more diverse dataset which helps our model to be more generalized and robust.

For loading audio samples, preprocessing, and feature extraction I have used a popular python library called Librosa. I highly recommend to install it using Conda to avoid any inconsistency and hassles on Windows.


import os
import numpy as np
import matplotlib.pyplot as plt
import librosa
import librosa.display
from scipy.fft import rfft, rfftfreq
import cv2
import pickle

from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
from sklearn.metrics import ConfusionMatrixDisplay

ROOT_DIR = 'C:/Users/amiri/Desktop/demo/dataset/'
SAMPLING_RATE = 44100 #it's consistent over the entire dataset recordings

def get_all_directories(root_path):
    dirs = os.listdir(root_path)
    dirs = [dir for dir in dirs if os.path.isdir(root_path+dir)]
    return dirs

def get_all_files(path):
    files = os.listdir(path)
    files = [file for file in files if os.path.isfile(path+file)]
    return files
   
def load_all_audio_files(root_path, duration=30):
    files = get_all_files(root_path)
    file_samples = []
    for file in files:
            samples, sampling_rate = librosa.load(root_path+file,
             sr=None, mono=True, offset=0.0, duration=duration)
            file_samples.append(samples)
    return file_samples

dataset = {}
for audio_class in get_all_directories(ROOT_DIR):
    dataset[audio_class] = load_all_audio_files(ROOT_DIR + audio_class+'/')
    print(f"number of {audio_class} samples: {len(dataset[audio_class])}") 


Now let’s visualize the samples and take a look at the audio signals to have an intuition about different audio shapes and variations.


fig, axs = plt.subplots(4,figsize=(8, 5), sharex=True,
constrained_layout = True) 
fig.suptitle('Time-Amplitude Visualizaition')

ax_index = 0
sample_index = 0
for audio_class in dataset:
    axs[ax_index].title.set_text(f'{audio_class} Audio Sample \n')
    librosa.display.waveshow(dataset[audio_class][sample_index],
     sr=SAMPLING_RATE, ax = axs[ax_index])
    ax_index+=1
plt.show()

Domain Specific Features

Based on the sound waves visualized above. We may be able to use some time-domain features like: number of zero crossings, mean flatness, maximum amplitude, minimum amplitude, kurtosis and skewness.


data = []
labels = []
for audio_class in dataset:
    for audio_sample in dataset[audio_class]:
        time_domain_features = list(get_time_domain_features(audio_sample))
        feature_set = np.concatenate([time_domain_features])
        labels.append(audio_class)
        data.append(feature_set)

data = np.array(data)
labels = np.array(labels)

Now that we constructed the feature set, we can go ahead and feed them into different classification methods and see if they can correctly classify audio recordings!


xtrain, xtest, ytrain, ytest = train_test_split(data, labels, test_size=0.3, shuffle=True)
svm_rbf = SVC()
svm_rbf.fit(xtrain, ytrain)
svm_rbf_scores = cross_val_score(svm_rbf, xtrain, ytrain, cv=10)
print('Average Cross Validation Score from Training:',
 svm_rbf_scores.mean(), sep='\n', end='\n\n\n')


svm_rbf_ypred = svm_rbf.predict(xtest)
svm_rbf_cr = classification_report(ytest, svm_rbf_ypred)

print('Test Statistics:', svm_rbf_cr, sep='\n', end='\n\n\n')

svm_rbf_accuracy = accuracy_score(ytest, svm_rbf_ypred)
print('Testing Accuracy:', svm_rbf_accuracy)

fig, ax = plt.subplots(figsize=(5,5))
ConfusionMatrixDisplay.from_estimator(svm_rbf, xtest, ytest,
 ax = ax, cmap='RdYlGn')
plt.show()

and Voila!

Disappointing! Isn’t it?!

Nope! It’s to early to get disappointed. We should explore and try more features. We can also convert the audio samples to frequency-domain and see if we can extract more meaningful information and features. It’s easy to change audio signals to frequency-domain in Librosa and also Scipy libraries using Fast Fourier Transform. It’s basically a method to transform discrete spacial data to a discrete frequency histogram. In other words it reveals the frequencies distribution of the signal.


from scipy.fft import rfft, rfftfreq

def Get_RFFT(audio_sample):
    N = len(audio_sample)
    yf = rfft(audio_sample)
    xf = rfftfreq(N, 1 / SAMPLING_RATE)
    return xf, yf

fig, axs = plt.subplots(4,figsize=(8, 5), sharex=True,
constrained_layout = True) 
fig.suptitle('Frequency-Domain Visualizaition')

ax_index = 0
sample_index = 0
for audio_class in dataset:
    audio_sample = dataset[audio_class][sample_index]
    axs[ax_index].title.set_text(f'{audio_class} \n')
    audio_sample_xf, audio_sample_yf = Get_RFFT(audio_sample)
    axs[ax_index].plot(audio_sample_xf, np.abs(audio_sample_yf))
    ax_index+=1

plt.show()