Noise Classifier | André Monforte

← Back to projects

Overview

The DC Noise Predictor is a machine learning service I developed to classify background noise in audio files, with a particular focus on dialogue data. This service improves upon the traditional Signal-to-Noise Ratio (SNR) metric by combining it with other acoustic features to achieve better performance specifically for dialogue recordings.

Problem Statement

In audio processing, particularly for dialogue recordings, accurately identifying background noise levels is crucial for quality assessment and downstream processing. Traditional metrics like Signal-to-Noise Ratio (SNR) alone are insufficient for dialogue data, which has unique characteristics compared to other audio types.

The challenge was to create a more sophisticated system that could:

Accurately classify background noise levels in dialogue recordings
Provide probability scores for confidence assessment
Process audio files efficiently in a production environment
Handle multi-channel audio with independent analysis per channel

Solution Architecture

I designed and implemented the service as a Nuclio serverless function deployed to a Kubernetes cluster. This architecture provides scalability, reliability, and efficient resource utilization. The service exposes an HTTP endpoint that accepts audio URLs and returns detailed noise classification results.

The system consists of four main components:

Handler: The main entry point that validates requests, downloads audio, orchestrates the processing pipeline, and formats results
Feature Extractor: Extracts acoustic features using Voice Activity Detection (VAD), SNR calculations, and volume metrics for both voiced and non-voiced segments
Predictor: Processes features through a trained XGBoost model with appropriate preprocessing transformations
Model: An XGBoost classifier trained to distinguish between noisy and silent audio channels

Technical Implementation

The service was implemented using Python with several specialized libraries:

dc-audio-core: For audio processing and reading operations
speech-audio-metrics: For extracting acoustic features
scikit-learn: For preprocessing and model pipeline management
XGBoost: For the classification model
dc-nuclio-utils: For logging and request handling
dc-request-authorization: For secure request authorization

API Design

The service exposes a clean, simple API that accepts audio URLs and returns detailed classification results:

Request Format

{
  "inputs": {
    "audioUrl": "https://example.com/audio-file.wav"
  }
}

Response Format

{
  "channel_1": {
    "predictedBackgroundNoise": "noisy",
    "noisyProbability": 0.875,
    "silentProbability": 0.125
  },
  "channel_2": {
    "predictedBackgroundNoise": "silent",
    "noisyProbability": 0.123,
    "silentProbability": 0.877
  }
}

Model Training

I trained the model using a dataset of labeled audio files with known background noise characteristics. The training process involved:

Feature extraction to create a comprehensive feature matrix
XGBoost classifier training with hyperparameter tuning via GridSearchCV
Evaluation using F1 score, accuracy, precision, and recall metrics
Cross-validation to ensure model robustness

The final model achieved impressive performance metrics:

F1 score: 0.880
Accuracy: 0.884
Precision: 0.884
Recall: 0.877

Deployment

The service was deployed as a Nuclio function in a Kubernetes environment. The deployment configuration defined in the function.yaml file specified:

Runtime environment (Python 3.8)
Required dependencies
Resource allocations
Environment variables

This serverless approach allowed for efficient scaling based on demand while minimizing resource usage during idle periods.

Challenges and Solutions

During development, I encountered several challenges:

Feature Selection: Identifying the most predictive acoustic features required extensive experimentation and domain knowledge
Model Tuning: Finding the optimal hyperparameters for the XGBoost model required careful cross-validation and grid search
Performance Optimization: Ensuring the service could process audio files efficiently required optimizing the feature extraction pipeline
Multi-Channel Support: Handling multi-channel audio required designing a flexible architecture that could process each channel independently

Results and Impact

The DC Noise Predictor service significantly improved the accuracy of background noise classification compared to traditional SNR-only approaches. This enhanced classification enabled:

Better quality assessment of dialogue recordings
More accurate filtering of audio files for downstream processing
Improved decision-making for audio enhancement processes
Quantifiable confidence scores for classification results

Limitations and Future Work

While the service performs well for its intended purpose, there are some limitations and areas for future improvement:

The service is specifically optimized for dialogue audio and may not perform optimally on other audio types
Performance may vary depending on audio quality and recording conditions
Future improvements could include expanding the training dataset with more diverse audio samples, exploring additional acoustic features, testing alternative machine learning models, adding support for non-dialogue audio, and implementing real-time processing capabilities

Conclusion

The DC Noise Predictor service demonstrates how combining traditional audio metrics with machine learning can create more accurate and useful audio classification systems. By focusing specifically on the unique characteristics of dialogue recordings, the service provides valuable insights that traditional metrics alone cannot capture.

Background Noise Classifier