From d52cc2bf12dafb41ace128d2f449b5748dbbeeb7 Mon Sep 17 00:00:00 2001 From: tcsenpai Date: Fri, 23 May 2025 11:51:40 +0200 Subject: [PATCH] removed unused whisperx params --- README.md | 156 ++++++++++++++++++++++++++++++++---------------------- app.py | 12 +++-- 2 files changed, 102 insertions(+), 66 deletions(-) diff --git a/README.md b/README.md index 8da1a7f..74e548d 100644 --- a/README.md +++ b/README.md @@ -1,67 +1,96 @@ -# Whisper Transcription Web App +# Audio/Video Transcription Web App -A user-friendly web application for transcribing audio and video files using OpenAI's Whisper model, powered by Gradio and faster-whisper. +A web application for transcribing audio and video files using WhisperX, with support for YouTube videos and optional summarization using Ollama. ## Features -- 🎙️ Transcribe audio and video files -- 🚀 GPU acceleration support -- 🌐 Multiple language support -- 📱 Responsive and modern UI -- 🔄 Multiple model options (tiny to large-v3) -- ⚙️ Configurable settings via config.ini -- 📺 YouTube video support with subtitle extraction +- Transcribe local audio/video files +- Process YouTube videos (with subtitle extraction when available) +- Automatic language detection +- Multiple WhisperX model options +- Optional text summarization using Ollama +- Modern web interface with Gradio +- Configurable settings via config.ini ## Requirements -- Python 3.10+ -- CUDA-capable GPU (recommended) -- FFmpeg (for audio/video processing) -- uv package manager +- Python 3.8+ +- CUDA-compatible GPU (recommended) +- FFmpeg installed on your system +- Ollama (optional, for summarization) ## Installation -1. Clone this repository: +1. Clone the repository: ```bash git clone cd whisperapp ``` -2. Install uv (if you just pip install you might break your environment): +2. Install the required packages: ```bash -curl -LsSf https://astral.sh/uv/install.sh | sh +pip install -r requirements.txt ``` -3. Create a venv with uv: +3. Install FFmpeg (if not already installed): +- Ubuntu/Debian: ```bash -uv venv --python=3.10 +sudo apt update && sudo apt install ffmpeg +``` +- macOS: +```bash +brew install ffmpeg +``` +- Windows: Download from [FFmpeg website](https://ffmpeg.org/download.html) + +4. Copy the example configuration file: +```bash +cp .env.example .env ``` -4. Install the required packages using uv: -```bash -uv pip install -r requirements.txt -``` +5. Edit the configuration files: +- `.env`: Set your environment variables +- `config.ini`: Configure WhisperX, Ollama, and application settings ## Configuration -The application can be configured through the `config.ini` file. Here are the available settings: +### Environment Variables (.env) -### Whisper Settings -- `default_model`: Default Whisper model to use -- `device`: Device to use (cuda/cpu) -- `compute_type`: Computation type (float16/float32) -- `beam_size`: Beam size for transcription -- `vad_filter`: Enable/disable voice activity detection +```ini +# Server configuration +SERVER_NAME=0.0.0.0 +SERVER_PORT=7860 +SHARE=true +``` -### App Settings -- `max_duration`: Maximum audio duration in seconds -- `server_name`: Server hostname -- `server_port`: Server port -- `share`: Enable/disable public sharing +### Application Settings (config.ini) -### Models and Languages -- `available_models`: Comma-separated list of available models -- `available_languages`: Comma-separated list of supported languages +```ini +[whisper] +default_model = base +device = cuda +compute_type = float32 +batch_size = 16 +vad = true + +[app] +max_duration = 3600 +server_name = 0.0.0.0 +server_port = 7860 +share = true + +[models] +available_models = tiny,base,small,medium,large-v1,large-v2,large-v3 + +[languages] +available_languages = en,es,fr,de,it,pt,nl,ja,ko,zh + +[ollama] +enabled = false +url = http://localhost:11434 +default_model = mistral +summarize_prompt = Please provide a comprehensive yet concise summary of the following text. Focus on the main points, key arguments, and important details while maintaining accuracy and completeness. Here's the text to summarize: +``` ## Usage @@ -70,43 +99,46 @@ The application can be configured through the `config.ini` file. Here are the av python app.py ``` -2. Open your web browser and navigate to `http://localhost:7860` +2. Open your web browser and navigate to: +``` +http://localhost:7860 +``` -3. Choose between two tabs: - - **Local File**: Upload and transcribe audio/video files - - **YouTube**: Process YouTube videos with subtitle extraction +3. Use the interface to: + - Upload and transcribe local audio/video files + - Process YouTube videos + - Generate summaries (if Ollama is configured) -### Local File Tab -1. Upload an audio or video file -2. Select your preferred model and language settings -3. Click "Transcribe" and wait for the results +## Features in Detail -### YouTube Tab -1. Enter a YouTube URL (supports youtube.com, youtu.be, and invidious URLs) -2. Select your preferred model and language settings -3. Click "Process Video" -4. The app will: - - First try to extract available subtitles - - If no subtitles are available, download and transcribe the video +### Local File Transcription +- Supports various audio and video formats +- Automatic language detection +- Multiple WhisperX model options +- Optional summarization with Ollama -## Model Options +### YouTube Video Processing +- Supports youtube.com, youtu.be, and invidious URLs +- Automatically extracts subtitles if available +- Falls back to transcription if no subtitles found +- Optional summarization with Ollama -- tiny: Fastest, lowest accuracy -- base: Good balance of speed and accuracy -- small: Better accuracy, moderate speed -- medium: High accuracy, slower -- large-v1/v2/v3: Highest accuracy, slowest +### Summarization +- Uses Ollama for text summarization +- Configurable model selection +- Customizable prompt +- Available for both local files and YouTube videos -## Tips +## Notes - For better accuracy, use larger models (medium, large) - Processing time increases with model size - GPU is recommended for faster processing -- Maximum audio duration is configurable in config.ini -- Use uv for faster package installation and dependency resolution +- Maximum audio duration is configurable (default: 60 minutes) - YouTube videos will first try to use available subtitles - If no subtitles are available, the video will be transcribed +- Ollama summarization is optional and requires Ollama to be running ## License -MIT License \ No newline at end of file +This project is licensed under the MIT License - see the LICENSE file for details. \ No newline at end of file diff --git a/app.py b/app.py index 9bfaa08..18e9cd8 100644 --- a/app.py +++ b/app.py @@ -139,8 +139,10 @@ def transcribe_audio( result = model.transcribe( audio_file, language=language if language != "Auto-detect" else None, - beam_size=BEAM_SIZE, - vad_filter=VAD_FILTER, + batch_size=16, # WhisperX uses batch_size instead of beam_size + vad=( + True if VAD_FILTER else False + ), # WhisperX uses vad instead of vad_filter ) # Get the full text with timestamps @@ -459,8 +461,10 @@ def create_interface(): result = model.transcribe( audio, language=lang if lang != "Auto-detect" else None, - beam_size=BEAM_SIZE, - vad_filter=VAD_FILTER, + batch_size=16, # WhisperX uses batch_size instead of beam_size + vad=( + True if VAD_FILTER else False + ), # WhisperX uses vad instead of vad_filter ) # Get the full text with timestamps