A web application for transcribing audio and video files using faster-whisper, with support for YouTube videos and optional summarization using Ollama.

Features

Transcribe local audio/video files
Process YouTube videos (with subtitle extraction when available)
Automatic language detection
Multiple Whisper model options
Optional text summarization using Ollama
Modern web interface with Gradio
Docker support with CUDA
Configurable settings via config.ini

Requirements

Docker and Docker Compose
NVIDIA GPU with CUDA support
NVIDIA Container Toolkit
Ollama installed locally (optional, for summarization)

Installation

Clone the repository:

git clone <repository-url>
cd whisperapp

Install NVIDIA Container Toolkit (if not already installed):

# Add NVIDIA package repositories
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

# Install nvidia-docker2 package
sudo apt-get update
sudo apt-get install -y nvidia-docker2

# Restart the Docker daemon
sudo systemctl restart docker

Install Ollama locally (optional, for summarization):

curl https://ollama.ai/install.sh | sh

Copy the example configuration file:

cp .env.example .env

Edit the configuration files:

.env: Set your environment variables
config.ini: Configure Whisper, Ollama, and application settings

Running the Application

Start Ollama locally (if you want to use summarization):

ollama serve

Build and start the Whisper app container:

docker-compose up --build

Open your web browser and navigate to:

http://localhost:7860

Configuration

Environment Variables (.env)

# Server configuration
SERVER_NAME=0.0.0.0
SERVER_PORT=7860
SHARE=true

Application Settings (config.ini)

[whisper]
default_model = base
device = cuda
compute_type = float16
beam_size = 5
vad_filter = true

[app]
max_duration = 3600
server_name = 0.0.0.0
server_port = 7860
share = true

[models]
available_models = tiny,base,small,medium,large-v1,large-v2,large-v3

[languages]
available_languages = en,es,fr,de,it,pt,nl,ja,ko,zh

[ollama]
enabled = false
url = http://host.docker.internal:11434
default_model = mistral
summarize_prompt = Please provide a comprehensive yet concise summary of the following text. Focus on the main points, key arguments, and important details while maintaining accuracy and completeness. Here's the text to summarize:

Features in Detail

Local File Transcription

Supports various audio and video formats
Automatic language detection
Multiple Whisper model options
Optional summarization with Ollama

YouTube Video Processing

Supports youtube.com, youtu.be, and invidious URLs
Automatically extracts subtitles if available
Falls back to transcription if no subtitles found
Optional summarization with Ollama

Summarization

Uses locally running Ollama for text summarization
Configurable model selection
Customizable prompt
Available for both local files and YouTube videos

Notes

For better accuracy, use larger models (medium, large)
Processing time increases with model size
GPU is recommended for faster processing
Maximum audio duration is configurable (default: 60 minutes)
YouTube videos will first try to use available subtitles
If no subtitles are available, the video will be transcribed
Ollama summarization is optional and requires Ollama to be running locally
The application runs in a Docker container with CUDA support
Models are downloaded and cached in the models directory
The container connects to the local Ollama instance using host.docker.internal

License

This project is licensed under the MIT License - see the LICENSE file for details.