Using Scikit-learn to Build a Model from Non-Tabular Data: Audio, Text, and Images

Aradhana Bopche
Apr 12
5 min read

Updated: Apr 24

In this blog, we’ll explore how to use Scikit-learn for building machine learning models with non-tabular data such as audio, text, and images. These data types require special preprocessing techniques, which we will cover in this post. By the end of this guide, you will have a deeper understanding of how to work with these non-tabular data formats using Scikit-learn and related libraries.

Step 1: Setting Up the Environment

Before getting into each type of non-tabular data, let's ensure our environment is set up with the necessary libraries.

We’ll need:

Scikit-learn: For building and evaluating machine learning models.
NumPy: For numerical operations.
Pandas: For handling any data manipulation (though minimal here).
Matplotlib/Seaborn: For visualization.
Librosa: For audio data handling.
nltk: For text data preprocessing (Natural Language Toolkit).
OpenCV or Pillow: For image processing.

You can install these libraries via pip:

Now, let’s import all the libraries into our project:

Step 2: Audio Data — Building a Model with Scikit-learn

Audio Data Overview

Audio data is a type of non-tabular data that is usually represented as waveforms, which show how sound (or air pressure) changes over time. These waveforms are essentially long sequences of numbers, and while they carry rich information, they are not directly usable by machine learning models. That’s why we need to preprocess the audio and extract useful features that represent important aspects of the sound, like pitch, frequency, and tone. This process is called feature extraction.

Two common methods for this are MFCC (Mel Frequency Cepstral Coefficients) and spectrograms. MFCCs are widely used in speech and music applications because they mimic how the human ear hears sound, capturing the tone and texture.

Spectrograms, on the other hand, are like colorful heatmaps that show which frequencies are present at each moment in time and how strong they are. These features help turn raw audio into meaningful patterns that models can learn from. Once extracted, these features can be used to train traditional machine learning models (like Random Forest or SVM) for tasks such as speech recognition, audio classification, and emotion detection.

Loading Audio Data with Librosa

For the purpose of this demonstration, we'll use a freely available sample audio file that comes with Librosa.

In the code above:

y represents the audio waveform (the raw audio signal).
sr is the sampling rate (how frequently the audio was recorded).

Extracting Audio Features with MFCC

Now that we have the audio loaded, we’ll extract MFCC features, which are a popular feature set for speech recognition and other audio-based machine learning tasks.

Extract MFCC features from the audio signal

In this code:

librosa.feature.mfcc() extracts 13 MFCC coefficients from the audio signal.

We use librosa.display.specshow() to display the MFCC features as a heatmap over time.

These MFCCs will be the features that we feed into a machine learning model, which can learn patterns from the audio.

Preparing the Data for ML Models

Before we can train a model, we need to prepare the data. Here, we’ll flatten the MFCCs into a 1D array and normalize the values. This step is necessary because most machine learning models expect a 2D input (samples x features).

Flatten the MFCCs to create a feature vector for each audio sample

Reshape the data (e.g., for one audio sample)

Sample target labels (for simplicity, let's use dummy labels for now)

In the above code:

np.mean() computes the mean of each MFCC coefficient over time.

StandardScaler() normalizes the features to have a mean of 0 and a standard deviation of 1, which is often helpful for machine learning models.

Training an Audio Classification Model

Now that we have the audio features (MFCCs), let's train a machine learning model. We will use Random Forest for this example, but you can easily swap it with other models like SVM or KNN.

Predict the class of the same sample (for simplicity) and Evaluate the model

This code:

Initializes and trains a Random Forest classifier.
We then predict the class of the same sample (which is a simplification—usually, we would split the data into training and test sets).
Finally, the model’s accuracy is printed.

Step 3: Image Data — Building a Model with Scikit-learn

Now that we've covered audio, let’s move to image data. We'll convert images into feature vectors that can be fed into machine learning models.

Images are a type of non-tabular data made up of pixels. Each pixel represents color and brightness values typically stored as numbers. But machine learning models can't directly work with images in their original format (like .jpg or .png). So we need to convert these images into numerical arrays that models can understand.

To do this, we’ll use image processing libraries like Pillow or OpenCV. These tools help us:

Open and display images
Convert them to grayscale (if needed)
Resize them to a consistent shape
Normalize the pixel values (scale them between 0 and 1)
Flatten them into a 1D feature vector

Loading and Preprocessing Image Data

We'll use Pillow or OpenCV to handle images. Let’s load an image and flatten it into a 1D vector:

This 1D vector becomes the input to our machine learning model just like rows of a table. For example, a 28x28 grayscale image will be turned into a 784-element vector (because 28×28 = 784). This format works well with traditional models like Random Forest, Logistic Regression, or SVM.

Convert the image to a numpy array and flatten it and Normalize the pixel values (scale between 0 and 1)

Training an Image Classifier

Now we’ll train a machine learning model (Random Forest, for instance) on the image data:

Predict the class of the same image and Evaluate the model

Step 4: Text Data — Building a Model with Scikit-learn

For text, we need to vectorize the text into numeric format, and TF-IDF (Term Frequency-Inverse Document Frequency) is a great method for this task.

Text Data Overview — Why Vectorization is Needed

Text is one of the most common and powerful sources of data in the real world — think of product reviews, tweets, news articles, or chatbot messages. But machine learning models can’t directly understand raw text like we do. They need numbers, not words.

So, before we can train a model on text data, we must convert it into numerical format, a process known as vectorization. One of the most effective and widely used techniques for this is TF-IDF, which stands for Term Frequency–Inverse Document Frequency.

What is TF-IDF?

Term Frequency (TF) tells us how often a word appears in a document. Words that appear more often are generally more important.
Inverse Document Frequency (IDF) tells us how common or rare a word is across all documents. Words that appear in almost every document (like “the”, “is”, or “and”) are less useful for distinguishing between texts, so they get lower weight.

TF-IDF combines both ideas. It gives higher importance to words that are frequent in a document but rare across other documents which makes them more useful for understanding the content or topic.

This method transforms each document (a sentence, review, etc.) into a vector of numbers. These vectors can then be used as input features for traditional machine learning models like Logistic Regression, Naive Bayes, or SVM.

Text Preprocessing with TF-IDF

Sample text data (let's classify movie reviews)

Training a Text Classification Model

Conclusion:

In this blog, we covered how to handle audio, text, and image data with Scikit-learn. Here’s a quick recap:

Data Type	Feature Extraction	Scikit-learn Model
Audio	MFCC	Random Forest, SVM
Image	Pixel values	Random Forest, SVM
Text	TF-IDF	Logistic Regression, Naive Bayes

Each data type requires different preprocessing steps, but Scikit-learn can handle them once they are transformed into numerical features.

Feel free to experiment with these steps using your own data, and happy learning!

$50

Product Title

Product Details goes here with the simple product description and more information can be seen by clicking the see more button. Product Details goes here with the simple product description and more information can be seen by clicking the see more button