Every machine learning pipeline starts with a data collection and prepration stage. I have collected audio samples in .WAV format using the Audacity desktop application on the windows with 44.1 khz sample rate. Note that we consider silence as a seperate class. Each sample is roughly 30 seconds. For music, fire alarm, and vacuum cleaner, I recorded samples from online audio. For each class, I collected at least 20 samples, in different sessions with different ambient noise and background to have a more diverse dataset which helps our model to be more generalized and robust.

For loading audio samples, preprocessing, and feature extraction I have used a popular python library called Librosa. I highly recommend to install it using Conda to avoid any inconsistency and hassles on Windows.

```
import os
import numpy as np
import matplotlib.pyplot as plt
import librosa
import librosa.display
from scipy.fft import rfft, rfftfreq
import cv2
import pickle
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
from sklearn.metrics import ConfusionMatrixDisplay
ROOT_DIR = 'C:/Users/amiri/Desktop/demo/dataset/'
SAMPLING_RATE = 44100 #it's consistent over the entire dataset recordings
def get_all_directories(root_path):
dirs = os.listdir(root_path)
dirs = [dir for dir in dirs if os.path.isdir(root_path+dir)]
return dirs
def get_all_files(path):
files = os.listdir(path)
files = [file for file in files if os.path.isfile(path+file)]
return files
def load_all_audio_files(root_path, duration=30):
files = get_all_files(root_path)
file_samples = []
for file in files:
samples, sampling_rate = librosa.load(root_path+file,
sr=None, mono=True, offset=0.0, duration=duration)
file_samples.append(samples)
return file_samples
dataset = {}
for audio_class in get_all_directories(ROOT_DIR):
dataset[audio_class] = load_all_audio_files(ROOT_DIR + audio_class+'/')
print(f"number of {audio_class} samples: {len(dataset[audio_class])}")
```

Now let’s visualize the samples and take a look at the audio signals to have an intuition about different audio shapes and variations.

```
fig, axs = plt.subplots(4,figsize=(8, 5), sharex=True,
constrained_layout = True)
fig.suptitle('Time-Amplitude Visualizaition')
ax_index = 0
sample_index = 0
for audio_class in dataset:
axs[ax_index].title.set_text(f'{audio_class} Audio Sample \n')
librosa.display.waveshow(dataset[audio_class][sample_index],
sr=SAMPLING_RATE, ax = axs[ax_index])
ax_index+=1
plt.show()
```

Based on the sound waves visualized above. We may be able to use some time-domain features like: number of zero crossings, mean flatness, maximum amplitude, minimum amplitude, kurtosis and skewness.

```
data = []
labels = []
for audio_class in dataset:
for audio_sample in dataset[audio_class]:
time_domain_features = list(get_time_domain_features(audio_sample))
feature_set = np.concatenate([time_domain_features])
labels.append(audio_class)
data.append(feature_set)
data = np.array(data)
labels = np.array(labels)
```

Now that we constructed the feature set, we can go ahead and feed them into different classification methods and see if they can correctly classify audio recordings!

```
xtrain, xtest, ytrain, ytest = train_test_split(data, labels, test_size=0.3, shuffle=True)
svm_rbf = SVC()
svm_rbf.fit(xtrain, ytrain)
svm_rbf_scores = cross_val_score(svm_rbf, xtrain, ytrain, cv=10)
print('Average Cross Validation Score from Training:',
svm_rbf_scores.mean(), sep='\n', end='\n\n\n')
svm_rbf_ypred = svm_rbf.predict(xtest)
svm_rbf_cr = classification_report(ytest, svm_rbf_ypred)
print('Test Statistics:', svm_rbf_cr, sep='\n', end='\n\n\n')
svm_rbf_accuracy = accuracy_score(ytest, svm_rbf_ypred)
print('Testing Accuracy:', svm_rbf_accuracy)
fig, ax = plt.subplots(figsize=(5,5))
ConfusionMatrixDisplay.from_estimator(svm_rbf, xtest, ytest,
ax = ax, cmap='RdYlGn')
plt.show()
```

and Voila!

Disappointing! Isn’t it?!

Nope! It’s to early to get disappointed. We should explore and try more features. We can also convert the audio samples to frequency-domain and see if we can extract more meaningful information and features. It’s easy to change audio signals to frequency-domain in Librosa and also Scipy libraries using Fast Fourier Transform. It’s basically a method to transform discrete spacial data to a discrete frequency histogram. In other words it reveals the frequencies distribution of the signal.

```
from scipy.fft import rfft, rfftfreq
def Get_RFFT(audio_sample):
N = len(audio_sample)
yf = rfft(audio_sample)
xf = rfftfreq(N, 1 / SAMPLING_RATE)
return xf, yf
fig, axs = plt.subplots(4,figsize=(8, 5), sharex=True,
constrained_layout = True)
fig.suptitle('Frequency-Domain Visualizaition')
ax_index = 0
sample_index = 0
for audio_class in dataset:
audio_sample = dataset[audio_class][sample_index]
axs[ax_index].title.set_text(f'{audio_class} \n')
audio_sample_xf, audio_sample_yf = Get_RFFT(audio_sample)
axs[ax_index].plot(audio_sample_xf, np.abs(audio_sample_yf))
ax_index+=1
plt.show()
```

“You do not really understand something unless you can explain it to your grandmother.”

It is quite important to be able to transfer your knowledge to others using plain and easily understandable descriptions which also helps to solidify your comprehension of the topic. So, I have decided to start a series of blog posts to build common machine learning algorithms from scratch inorder to clarify these methods for myself and make sure I corectly understand the mechanics behind each of them.

Let me be clear from the very begining: **It’s all about fitting a function!**

Consider we have a dataset containing the number of trucks crossed border from Mexico to U.S through Ottay Messa port. Note that it’s just a subset of the entire inbound crossings dataset available on Kaggle. First of all, it’s always better to plot the data which may help us have some insight. To load the data from a .CSV file, we are going use Pandas which is a well-known data analysis/manipulation python library. We can then plot the data using Matplotlib (another python library for data visualization).

```
df = pd.read_csv('./regression/dataset.csv')
x_training_set = pd.to_datetime(df.Date, infer_datetime_format=True).to_numpy()
y_training_set = df.Value.to_numpy()
number_of_training_examples = len(x_training_set)
#plot the raw data
fig, ax = plt.subplots(figsize=(8, 6))
year_locator = mdates.YearLocator(2)
year_month_formatter = mdates.DateFormatter("%Y-%m")
ax.xaxis.set_major_locator(year_locator)
ax.xaxis.set_minor_locator(mdates.YearLocator())
ax.xaxis.set_major_formatter(year_month_formatter)
ax.plot(x_training_set,y_training_set, ".")
fig.autofmt_xdate()
plt.show()
```

Note that in the original data, each value is corresponding to a month, so I mapped the date intervals into an integer representation.

What we are observing here obviosuly is not an exact linear function, but for the sake of simplicity we can model broder corossings using a linear function! As we already know, the equation of a line is a below:

\[f(x) = mx + c\]where **m** stands for the slope and **c** is the intercep on the y axis. But we can have infinite number of possible values for these parameters. Let’s look at some possible arbitrary lines with values [m=70,c=40000], [m=100,c=40000], [m=140,c=40000] represented with orange, green, and red colors respectively.

```
#plot arbitrary lines
def plot_line(ax, m,c,xMax):
y0 = (1*m)+c
ymax = (xMax*m)+c
ax.plot([x_training_set[0],x_training_set[xMax-1]],[y0,ymax])
fig, ax = plt.subplots(figsize=(8, 6))
year_locator = mdates.YearLocator(2)
year_month_formatter = mdates.DateFormatter("%Y-%m")
ax.xaxis.set_major_locator(year_locator)
ax.xaxis.set_minor_locator(mdates.YearLocator())
ax.xaxis.set_major_formatter(year_month_formatter)
ax.plot(x_training_set,y_training_set, ".")
fig.autofmt_xdate()
plot_line(ax,70,40000,number_of_training_examples)
plot_line(ax,100,40000,number_of_training_examples)
plot_line(ax,140,40000,number_of_training_examples)
plt.show()
```

But, What are the parameter values we are choosing for our linear equation to properly fit our data points?

To find the proper fit to our data, basically we have to minimize the average distance of the data points to our arbitrary line. In other words, we are finding the difference between the predicted value with the actual traing data.

there are two ways to calculate the error:

**Mean Squared Error (MSE)**: which considers the squared difference of the values.

**Mean Absolute Error (MAE)**: which considers the absolute difference of the values.

Note that we need to sum up these error values as positive numbers. Otherwise negative values will compensate for positive ones which will make our optimization problem impossible.

Our objective is to find our linear equation parameters **m** and **c** to minimize average error over all training set data (know as the **cost function**) defined below:

where **n** is the number of training examples. Now let’s explore the parameters space and plot the cost function to see how it looks like. for the MSE cost function we have the parameters space-cost plot below

```
# visualize cost function
def line_equation(m,c,x):
return (m*x)+ c
def cost_function(m,c,training_examples_x,training_examples_y):
sum_of_errors = 0
item_index = 0
for example in training_examples_x:
predicted_y = line_equation(m,c,item_index)
sum_of_errors += (predicted_y - training_examples_y[item_index])**2
#sum_of_errors += abs(predicted_y - training_examples_y[item_index])
item_index+=1
mse = (sum_of_errors / len(training_examples_x))
return mse
fig = plt.figure()
fig.set_size_inches(8, 6)
ax = fig.add_subplot(projection='3d')
cost_func_x_points = []
cost_func_y_points = []
cost_func_z_points = []
for m in np.arange(-200,500,10):
for c in np.arange(-10000,60000,200):
cost = cost_function(m,c,x_training_set,y_training_set)
cost_func_x_points.append(m)
cost_func_y_points.append(c)
cost_func_z_points.append(cost)
ax.scatter(cost_func_x_points, cost_func_y_points,
cost_func_z_points,c=cost_func_z_points,marker='.')
ax.set_xlabel('M')
ax.set_ylabel('C')
ax.set_zlabel('Cost')
plt.show()
```

and for the MAE we have the plot below:

As you can see both these cost functions are convex and they just have one minimum point at the bottom of the slope(global minima). Then based on calculus, it means that if we could find the point that the derivative of this function is zero, we have found the optimal parameters for our model. We can simply use the equation below to find the parameters.

\[theta = (X^T.X)^-1.(X^T.Y)\]where X is the training features sample vector. Y is the output vector. and the result(theta) are the parameters for our regression model.

This equation is called Normal Equation and you can find the math behind it here.

So let’s run it on our dataset and see how it works.

```
#linear equation using numpy
x_training_set_numbers = np.arange(0,len(x_training_set),1)
ones_vector = np.ones((len(x_training_set), 1))
x_training_set_numbers = np.reshape(x_training_set_numbers, (len(x_training_set_numbers), 1))
x_training_set_numbers = np.append(ones_vector, x_training_set_numbers, axis=1)
theta_list = np.linalg.inv(x_training_set_numbers.T.dot(x_training_set_numbers)) \
.dot(x_training_set_numbers.T).dot(y_training_set)
print(theta_list)
#visualize raw data with fitted line
fig, ax = plt.subplots(figsize=(8, 6))
year_locator = mdates.YearLocator(2)
year_month_formatter = mdates.DateFormatter("%Y-%m")
ax.xaxis.set_major_locator(year_locator)
ax.xaxis.set_minor_locator(mdates.YearLocator())
ax.xaxis.set_major_formatter(year_month_formatter)
ax.plot(x_training_set,y_training_set, ".")
fig.autofmt_xdate()
plot_line(ax,theta_list[1],theta_list[0],number_of_training_examples)
plt.show()
```

Now that we have fit a line on our data, are we done ? Nope!

We need to evaluate the model utilizing different metrics and plots to make sure the model we proposed would generalize well on new data.

To evaluvate a model there are several metrics we can use.

**1- MSE (Mean Squared Error)**

**2- MAE (Mean Absolute Error)**

These two metrices mentioned above, have the exact same definition we had in defining the cost functions. The main difference between these two is MSE penilize prediction errors heavily in comparison to MAE. Generally, we want these scores to be as close as possible to zero.

```
mse_value = 0
mae_value = 0
m = theta_list[1]
c = theta_list[0]
for sample_x in range(0,number_of_training_examples):
predicted_y = m * sample_x + c
sample_y = int(y_training_set[sample_x])
mse_value += (predicted_y - sample_y)**2
mae_value += abs(predicted_y - sample_y)
print(f"Mean Squared Error: {mse_value//number_of_training_examples}\n")
print(f"Mean Absolute Error: {mae_value//number_of_training_examples}\n")
```

**3- R-Squared Score**

In terms of regression evaluation, R-Square score or R2 score is one the most useful ones. It indicates how much variance of the data is explained by the model. By other words, it indicates how close are the data point to the fitted line. R2 score is defined as below:

\[R2 = {\frac{ModelVariance}{TotalVariance}}\] \[= {1 - \frac{Sum Of Squared Regression(SSR)}{Total Sum Of Squares(SST)}}\] \[= {1 - \frac{\sum_{i=1}^{n} (y - y_p)^2}{\sum_{i=1}^{n} (y - y_m)^2}}\]R2 score must be a number between 0 and 1. In regression modeling we aim to maximize R2 score toward 1 as much as possible. Note that if the R2 score is a negative value, it’s indicating that something is wrong with the modeling or implementations.

**4- Residual Plot**

Also it’s crucial to plot residuals to see if linear regression is a good choice to model our data. In regression, residual is the difference between the actual value and the predicted value. Residual points have to distribute equally along the horizontal axis. Otherwise the model is not performing good and probalbly it’s not reliable enough.

\[residuals = y - y_p\]In the figure above, except the beggining and the end of the plot,which residuals have greater values, residuals are somehow are randomlly distributed. Generally, for a good linear regression model, data points in this plot must be as close as possible to the horizontal axis, and also they have to be uniformly distributed along the plot.

]]>Cuda toolkit is another software layers on top of the Nvidia Driver. As it is mentioned in Nvidia website, different Cuda toolkit versions are mostly forward compatible with Cuda drivers. It means that if you already have nvidia-driver-515, which is a fairly new version, it is compatible with cuda-toolkit-11-2.

**Installing Nvidia Driver**

Nvidia driver is the underlying libraries necessary for making the operating system (In our case **Ubuntu 20.04**) work graphical processors. To install drivers you can simply run following command in the terminal:

```
sudo ubuntu-drivers autoinstall
```

or the following command in case you need a specific version:

```
sudo apt install nvidia-driver-470
```

it is also possible to install the drivers using Ubuntu Software Center

to verify successful installation you should run the command below:

```
nvidia-smi
```

If you look at TensorFlow installation instructions, the compatible cuda-toolkit version is mentioned in Conda install command.

To make TensorFlow work, we should install cuda-toolkit 11.2. basically, we assume that Conda will take care of cudatoolkit installation, but it **DOESN’T**. It is necessary to install cuda toolkit separately on the system. We can either install Cuda toolkit using runfile provided by Nvidia or install it using apt command.

Runfile installation:

```
wget https://developer.download.nvidia.com/compute/cuda/11.2.0/local_installers/cuda_11.2.0_460.27.04_linux.run
sudo sh cuda_11.2.0_460.27.04_linux.run
```

Apt installation:

```
sudo apt install nvidia-cuda-toolkit-11-2
```

Note: it is necessary to set cuda toolkit path to environment variables to make tensorflow and pytorch able to find libraries and tools. you can add the following commands to your **~/.bashrc** to pick the cudatoolkit automatically in every new bash.

```
export CUDA_HOME=/usr/local/cuda
export PATH=/usr/local/cuda/bin:$PATH
export CPATH=/usr/local/cuda/include:$CPATH
export LIBRARY_PATH=/usr/local/cuda/lib64:$LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH
```

To verify cuda toolkit installation you should run:

```
nvcc --version
```

**Setting up PyTorch(GPU):**

As it is provided by the PyTorch website, it’s possible to setup different versions of PyTorch using Conda. It is supporting cuda toolkit 10.7, 11.3 and 11.6.

Note that any other versions of cuda toolkit will not work with PyTorch, so make sure the installed version of cuda toolkit is among the versions mentioned in PyTorch installation instructions!

**Setting up TensorFlow(GPU):**

After setting up cuda driver and toolkit, we are ready to install tensorflow using commands bellow:

```
conda install -c conda-forge cudatoolkit=11.2 cudnn=8.1.0
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/
python3 -m pip install tensorflow
# Verify install:
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
```

Note that before using TensorFlow, each time a new terminal is opened, it is **necessary** to define cuda toolkit path in the environment with the command below. It is also possible to add the cuda toolkit path permanently using the commands below,

```
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/' > $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
```

This way Conda will automatically load cuda toolkit path each time activating an environment.

**Setting up TensorFlow(GPU) with Docker:**

One way to avoid problems with cuda-toolkit and also a way to be able to use different versions of TensorFlow is running Tensorflow docker containers which **DIRECTLY**interact with Cuda driver. There are different containers of tensorflow each integrated with Jupiter lab. After running the container, TensorFlow will be accessible through Jupyter lab.

after installing **Docker** you should install nvidia-docker2 using the following command:

```
sudo apt-get install -y nvidia-docker2
```

pull the TensorFlow(bundled with Jupyterlab) container:

```
docker pull tensorflow/tensorflow:latest-gpu-jupyter
```

then run the container and access Jupyter lab thorugh your browser:

```
docker run --gpus all -it -p 8888:8888 tensorflow/tensorflow:latest-gpu-jupyter
```

make sure adding **–gpus all** flag to the command to give docker access to GPUs. you can then access to the jupyter notebook through **http://yourip:8888/** (use localhost if you are running it on a local computer). Also you can add **–restart unless-stopped** flag to make docker container run after each restart(unless stoped manually).

```
docker run --restart unless-stopped --gpus all -it -p 8888:8888 tensorflow/tensorflow:latest-gpu-jupyter
```

to verify if GPU devices are available in tensorflow run the following code in a jupyter notebook cell

```
import tensorflow as tf
print(tf.config.list_physical_devices('GPU'))
```