Using Scikit-learn to Build a Model from Non-Tabular Data: Audio, Text, and Images
- Aradhana Bopche
- Apr 12
- 5 min read
Updated: Apr 24
In this blog, we’ll explore how to use Scikit-learn for building machine learning models with non-tabular data such as audio, text, and images. These data types require special preprocessing techniques, which we will cover in this post. By the end of this guide, you will have a deeper understanding of how to work with these non-tabular data formats using Scikit-learn and related libraries.
Step 1: Setting Up the Environment
Before getting into each type of non-tabular data, let's ensure our environment is set up with the necessary libraries.
We’ll need:
Scikit-learn: For building and evaluating machine learning models.
NumPy: For numerical operations.
Pandas: For handling any data manipulation (though minimal here).
Matplotlib/Seaborn: For visualization.
Librosa: For audio data handling.
nltk: For text data preprocessing (Natural Language Toolkit).
OpenCV or Pillow: For image processing.
You can install these libraries via pip:

Now, let’s import all the libraries into our project:

Step 2: Audio Data — Building a Model with Scikit-learn
Audio Data Overview
Audio data is a type of non-tabular data that is usually represented as waveforms, which show how sound (or air pressure) changes over time. These waveforms are essentially long sequences of numbers, and while they carry rich information, they are not directly usable by machine learning models. That’s why we need to preprocess the audio and extract useful features that represent important aspects of the sound, like pitch, frequency, and tone. This process is called feature extraction.
Two common methods for this are MFCC (Mel Frequency Cepstral Coefficients) and spectrograms. MFCCs are widely used in speech and music applications because they mimic how the human ear hears sound, capturing the tone and texture.
Spectrograms, on the other hand, are like colorful heatmaps that show which frequencies are present at each moment in time and how strong they are. These features help turn raw audio into meaningful patterns that models can learn from. Once extracted, these features can be used to train traditional machine learning models (like Random Forest or SVM) for tasks such as speech recognition, audio classification, and emotion detection.
Loading Audio Data with Librosa

In the code above:
y represents the audio waveform (the raw audio signal).
sr is the sampling rate (how frequently the audio was recorded).
Extracting Audio Features with MFCC
Now that we have the audio loaded, we’ll extract MFCC features, which are a popular feature set for speech recognition and other audio-based machine learning tasks.


In this code:
librosa.feature.mfcc() extracts 13 MFCC coefficients from the audio signal.
We use librosa.display.specshow() to display the MFCC features as a heatmap over time.
These MFCCs will be the features that we feed into a machine learning model, which can learn patterns from the audio.
Preparing the Data for ML Models
Before we can train a model, we need to prepare the data. Here, we’ll flatten the MFCCs into a 1D array and normalize the values. This step is necessary because most machine learning models expect a 2D input (samples x features).



In the above code:
np.mean() computes the mean of each MFCC coefficient over time.
StandardScaler() normalizes the features to have a mean of 0 and a standard deviation of 1, which is often helpful for machine learning models.
Training an Audio Classification Model
Now that we have the audio features (MFCCs), let's train a machine learning model. We will use Random Forest for this example, but you can easily swap it with other models like SVM or KNN.


This code:
Initializes and trains a Random Forest classifier.
We then predict the class of the same sample (which is a simplification—usually, we would split the data into training and test sets).
Finally, the model’s accuracy is printed.
Step 3: Image Data — Building a Model with Scikit-learn
Now that we've covered audio, let’s move to image data. We'll convert images into feature vectors that can be fed into machine learning models.
Images are a type of non-tabular data made up of pixels. Each pixel represents color and brightness values typically stored as numbers. But machine learning models can't directly work with images in their original format (like .jpg or .png). So we need to convert these images into numerical arrays that models can understand.
To do this, we’ll use image processing libraries like Pillow or OpenCV. These tools help us:
Open and display images
Convert them to grayscale (if needed)
Resize them to a consistent shape
Normalize the pixel values (scale them between 0 and 1)
Flatten them into a 1D feature vector
Loading and Preprocessing Image Data
We'll use Pillow or OpenCV to handle images. Let’s load an image and flatten it into a 1D vector:
This 1D vector becomes the input to our machine learning model just like rows of a table. For example, a 28x28 grayscale image will be turned into a 784-element vector (because 28×28 = 784). This format works well with traditional models like Random Forest, Logistic Regression, or SVM.


Training an Image Classifier
Now we’ll train a machine learning model (Random Forest, for instance) on the image data:


Step 4: Text Data — Building a Model with Scikit-learn
For text, we need to vectorize the text into numeric format, and TF-IDF (Term Frequency-Inverse Document Frequency) is a great method for this task.
Text Data Overview — Why Vectorization is Needed
Text is one of the most common and powerful sources of data in the real world — think of product reviews, tweets, news articles, or chatbot messages. But machine learning models can’t directly understand raw text like we do. They need numbers, not words.
So, before we can train a model on text data, we must convert it into numerical format, a process known as vectorization. One of the most effective and widely used techniques for this is TF-IDF, which stands for Term Frequency–Inverse Document Frequency.
What is TF-IDF?
Term Frequency (TF) tells us how often a word appears in a document. Words that appear more often are generally more important.
Inverse Document Frequency (IDF) tells us how common or rare a word is across all documents. Words that appear in almost every document (like “the”, “is”, or “and”) are less useful for distinguishing between texts, so they get lower weight.
TF-IDF combines both ideas. It gives higher importance to words that are frequent in a document but rare across other documents which makes them more useful for understanding the content or topic.
This method transforms each document (a sentence, review, etc.) into a vector of numbers. These vectors can then be used as input features for traditional machine learning models like Logistic Regression, Naive Bayes, or SVM.
Text Preprocessing with TF-IDF


Training a Text Classification Model


Conclusion:
In this blog, we covered how to handle audio, text, and image data with Scikit-learn. Here’s a quick recap:
Data Type | Feature Extraction | Scikit-learn Model |
Audio | MFCC | Random Forest, SVM |
Image | Pixel values | Random Forest, SVM |
Text | TF-IDF | Logistic Regression, Naive Bayes |
Each data type requires different preprocessing steps, but Scikit-learn can handle them once they are transformed into numerical features.
Feel free to experiment with these steps using your own data, and happy learning!
Comentarios