removed unused whisperx params

This commit is contained in:
tcsenpai 2025-05-23 11:51:40 +02:00
parent 02f580d195
commit d52cc2bf12
2 changed files with 102 additions and 66 deletions

156
README.md
View File

@ -1,67 +1,96 @@
# Whisper Transcription Web App # Audio/Video Transcription Web App
A user-friendly web application for transcribing audio and video files using OpenAI's Whisper model, powered by Gradio and faster-whisper. A web application for transcribing audio and video files using WhisperX, with support for YouTube videos and optional summarization using Ollama.
## Features ## Features
- 🎙️ Transcribe audio and video files - Transcribe local audio/video files
- 🚀 GPU acceleration support - Process YouTube videos (with subtitle extraction when available)
- 🌐 Multiple language support - Automatic language detection
- 📱 Responsive and modern UI - Multiple WhisperX model options
- 🔄 Multiple model options (tiny to large-v3) - Optional text summarization using Ollama
- ⚙️ Configurable settings via config.ini - Modern web interface with Gradio
- 📺 YouTube video support with subtitle extraction - Configurable settings via config.ini
## Requirements ## Requirements
- Python 3.10+ - Python 3.8+
- CUDA-capable GPU (recommended) - CUDA-compatible GPU (recommended)
- FFmpeg (for audio/video processing) - FFmpeg installed on your system
- uv package manager - Ollama (optional, for summarization)
## Installation ## Installation
1. Clone this repository: 1. Clone the repository:
```bash ```bash
git clone <repository-url> git clone <repository-url>
cd whisperapp cd whisperapp
``` ```
2. Install uv (if you just pip install you might break your environment): 2. Install the required packages:
```bash ```bash
curl -LsSf https://astral.sh/uv/install.sh | sh pip install -r requirements.txt
``` ```
3. Create a venv with uv: 3. Install FFmpeg (if not already installed):
- Ubuntu/Debian:
```bash ```bash
uv venv --python=3.10 sudo apt update && sudo apt install ffmpeg
```
- macOS:
```bash
brew install ffmpeg
```
- Windows: Download from [FFmpeg website](https://ffmpeg.org/download.html)
4. Copy the example configuration file:
```bash
cp .env.example .env
``` ```
4. Install the required packages using uv: 5. Edit the configuration files:
```bash - `.env`: Set your environment variables
uv pip install -r requirements.txt - `config.ini`: Configure WhisperX, Ollama, and application settings
```
## Configuration ## Configuration
The application can be configured through the `config.ini` file. Here are the available settings: ### Environment Variables (.env)
### Whisper Settings ```ini
- `default_model`: Default Whisper model to use # Server configuration
- `device`: Device to use (cuda/cpu) SERVER_NAME=0.0.0.0
- `compute_type`: Computation type (float16/float32) SERVER_PORT=7860
- `beam_size`: Beam size for transcription SHARE=true
- `vad_filter`: Enable/disable voice activity detection ```
### App Settings ### Application Settings (config.ini)
- `max_duration`: Maximum audio duration in seconds
- `server_name`: Server hostname
- `server_port`: Server port
- `share`: Enable/disable public sharing
### Models and Languages ```ini
- `available_models`: Comma-separated list of available models [whisper]
- `available_languages`: Comma-separated list of supported languages default_model = base
device = cuda
compute_type = float32
batch_size = 16
vad = true
[app]
max_duration = 3600
server_name = 0.0.0.0
server_port = 7860
share = true
[models]
available_models = tiny,base,small,medium,large-v1,large-v2,large-v3
[languages]
available_languages = en,es,fr,de,it,pt,nl,ja,ko,zh
[ollama]
enabled = false
url = http://localhost:11434
default_model = mistral
summarize_prompt = Please provide a comprehensive yet concise summary of the following text. Focus on the main points, key arguments, and important details while maintaining accuracy and completeness. Here's the text to summarize:
```
## Usage ## Usage
@ -70,43 +99,46 @@ The application can be configured through the `config.ini` file. Here are the av
python app.py python app.py
``` ```
2. Open your web browser and navigate to `http://localhost:7860` 2. Open your web browser and navigate to:
```
http://localhost:7860
```
3. Choose between two tabs: 3. Use the interface to:
- **Local File**: Upload and transcribe audio/video files - Upload and transcribe local audio/video files
- **YouTube**: Process YouTube videos with subtitle extraction - Process YouTube videos
- Generate summaries (if Ollama is configured)
### Local File Tab ## Features in Detail
1. Upload an audio or video file
2. Select your preferred model and language settings
3. Click "Transcribe" and wait for the results
### YouTube Tab ### Local File Transcription
1. Enter a YouTube URL (supports youtube.com, youtu.be, and invidious URLs) - Supports various audio and video formats
2. Select your preferred model and language settings - Automatic language detection
3. Click "Process Video" - Multiple WhisperX model options
4. The app will: - Optional summarization with Ollama
- First try to extract available subtitles
- If no subtitles are available, download and transcribe the video
## Model Options ### YouTube Video Processing
- Supports youtube.com, youtu.be, and invidious URLs
- Automatically extracts subtitles if available
- Falls back to transcription if no subtitles found
- Optional summarization with Ollama
- tiny: Fastest, lowest accuracy ### Summarization
- base: Good balance of speed and accuracy - Uses Ollama for text summarization
- small: Better accuracy, moderate speed - Configurable model selection
- medium: High accuracy, slower - Customizable prompt
- large-v1/v2/v3: Highest accuracy, slowest - Available for both local files and YouTube videos
## Tips ## Notes
- For better accuracy, use larger models (medium, large) - For better accuracy, use larger models (medium, large)
- Processing time increases with model size - Processing time increases with model size
- GPU is recommended for faster processing - GPU is recommended for faster processing
- Maximum audio duration is configurable in config.ini - Maximum audio duration is configurable (default: 60 minutes)
- Use uv for faster package installation and dependency resolution
- YouTube videos will first try to use available subtitles - YouTube videos will first try to use available subtitles
- If no subtitles are available, the video will be transcribed - If no subtitles are available, the video will be transcribed
- Ollama summarization is optional and requires Ollama to be running
## License ## License
MIT License This project is licensed under the MIT License - see the LICENSE file for details.

12
app.py
View File

@ -139,8 +139,10 @@ def transcribe_audio(
result = model.transcribe( result = model.transcribe(
audio_file, audio_file,
language=language if language != "Auto-detect" else None, language=language if language != "Auto-detect" else None,
beam_size=BEAM_SIZE, batch_size=16, # WhisperX uses batch_size instead of beam_size
vad_filter=VAD_FILTER, vad=(
True if VAD_FILTER else False
), # WhisperX uses vad instead of vad_filter
) )
# Get the full text with timestamps # Get the full text with timestamps
@ -459,8 +461,10 @@ def create_interface():
result = model.transcribe( result = model.transcribe(
audio, audio,
language=lang if lang != "Auto-detect" else None, language=lang if lang != "Auto-detect" else None,
beam_size=BEAM_SIZE, batch_size=16, # WhisperX uses batch_size instead of beam_size
vad_filter=VAD_FILTER, vad=(
True if VAD_FILTER else False
), # WhisperX uses vad instead of vad_filter
) )
# Get the full text with timestamps # Get the full text with timestamps