mirror of
https://github.com/tcsenpai/youlama.git
synced 2025-06-06 11:15:38 +00:00
removed unused whisperx params
This commit is contained in:
parent
02f580d195
commit
d52cc2bf12
156
README.md
156
README.md
@ -1,67 +1,96 @@
|
|||||||
# Whisper Transcription Web App
|
# Audio/Video Transcription Web App
|
||||||
|
|
||||||
A user-friendly web application for transcribing audio and video files using OpenAI's Whisper model, powered by Gradio and faster-whisper.
|
A web application for transcribing audio and video files using WhisperX, with support for YouTube videos and optional summarization using Ollama.
|
||||||
|
|
||||||
## Features
|
## Features
|
||||||
|
|
||||||
- 🎙️ Transcribe audio and video files
|
- Transcribe local audio/video files
|
||||||
- 🚀 GPU acceleration support
|
- Process YouTube videos (with subtitle extraction when available)
|
||||||
- 🌐 Multiple language support
|
- Automatic language detection
|
||||||
- 📱 Responsive and modern UI
|
- Multiple WhisperX model options
|
||||||
- 🔄 Multiple model options (tiny to large-v3)
|
- Optional text summarization using Ollama
|
||||||
- ⚙️ Configurable settings via config.ini
|
- Modern web interface with Gradio
|
||||||
- 📺 YouTube video support with subtitle extraction
|
- Configurable settings via config.ini
|
||||||
|
|
||||||
## Requirements
|
## Requirements
|
||||||
|
|
||||||
- Python 3.10+
|
- Python 3.8+
|
||||||
- CUDA-capable GPU (recommended)
|
- CUDA-compatible GPU (recommended)
|
||||||
- FFmpeg (for audio/video processing)
|
- FFmpeg installed on your system
|
||||||
- uv package manager
|
- Ollama (optional, for summarization)
|
||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
1. Clone this repository:
|
1. Clone the repository:
|
||||||
```bash
|
```bash
|
||||||
git clone <repository-url>
|
git clone <repository-url>
|
||||||
cd whisperapp
|
cd whisperapp
|
||||||
```
|
```
|
||||||
|
|
||||||
2. Install uv (if you just pip install you might break your environment):
|
2. Install the required packages:
|
||||||
```bash
|
```bash
|
||||||
curl -LsSf https://astral.sh/uv/install.sh | sh
|
pip install -r requirements.txt
|
||||||
```
|
```
|
||||||
|
|
||||||
3. Create a venv with uv:
|
3. Install FFmpeg (if not already installed):
|
||||||
|
- Ubuntu/Debian:
|
||||||
```bash
|
```bash
|
||||||
uv venv --python=3.10
|
sudo apt update && sudo apt install ffmpeg
|
||||||
|
```
|
||||||
|
- macOS:
|
||||||
|
```bash
|
||||||
|
brew install ffmpeg
|
||||||
|
```
|
||||||
|
- Windows: Download from [FFmpeg website](https://ffmpeg.org/download.html)
|
||||||
|
|
||||||
|
4. Copy the example configuration file:
|
||||||
|
```bash
|
||||||
|
cp .env.example .env
|
||||||
```
|
```
|
||||||
|
|
||||||
4. Install the required packages using uv:
|
5. Edit the configuration files:
|
||||||
```bash
|
- `.env`: Set your environment variables
|
||||||
uv pip install -r requirements.txt
|
- `config.ini`: Configure WhisperX, Ollama, and application settings
|
||||||
```
|
|
||||||
|
|
||||||
## Configuration
|
## Configuration
|
||||||
|
|
||||||
The application can be configured through the `config.ini` file. Here are the available settings:
|
### Environment Variables (.env)
|
||||||
|
|
||||||
### Whisper Settings
|
```ini
|
||||||
- `default_model`: Default Whisper model to use
|
# Server configuration
|
||||||
- `device`: Device to use (cuda/cpu)
|
SERVER_NAME=0.0.0.0
|
||||||
- `compute_type`: Computation type (float16/float32)
|
SERVER_PORT=7860
|
||||||
- `beam_size`: Beam size for transcription
|
SHARE=true
|
||||||
- `vad_filter`: Enable/disable voice activity detection
|
```
|
||||||
|
|
||||||
### App Settings
|
### Application Settings (config.ini)
|
||||||
- `max_duration`: Maximum audio duration in seconds
|
|
||||||
- `server_name`: Server hostname
|
|
||||||
- `server_port`: Server port
|
|
||||||
- `share`: Enable/disable public sharing
|
|
||||||
|
|
||||||
### Models and Languages
|
```ini
|
||||||
- `available_models`: Comma-separated list of available models
|
[whisper]
|
||||||
- `available_languages`: Comma-separated list of supported languages
|
default_model = base
|
||||||
|
device = cuda
|
||||||
|
compute_type = float32
|
||||||
|
batch_size = 16
|
||||||
|
vad = true
|
||||||
|
|
||||||
|
[app]
|
||||||
|
max_duration = 3600
|
||||||
|
server_name = 0.0.0.0
|
||||||
|
server_port = 7860
|
||||||
|
share = true
|
||||||
|
|
||||||
|
[models]
|
||||||
|
available_models = tiny,base,small,medium,large-v1,large-v2,large-v3
|
||||||
|
|
||||||
|
[languages]
|
||||||
|
available_languages = en,es,fr,de,it,pt,nl,ja,ko,zh
|
||||||
|
|
||||||
|
[ollama]
|
||||||
|
enabled = false
|
||||||
|
url = http://localhost:11434
|
||||||
|
default_model = mistral
|
||||||
|
summarize_prompt = Please provide a comprehensive yet concise summary of the following text. Focus on the main points, key arguments, and important details while maintaining accuracy and completeness. Here's the text to summarize:
|
||||||
|
```
|
||||||
|
|
||||||
## Usage
|
## Usage
|
||||||
|
|
||||||
@ -70,43 +99,46 @@ The application can be configured through the `config.ini` file. Here are the av
|
|||||||
python app.py
|
python app.py
|
||||||
```
|
```
|
||||||
|
|
||||||
2. Open your web browser and navigate to `http://localhost:7860`
|
2. Open your web browser and navigate to:
|
||||||
|
```
|
||||||
|
http://localhost:7860
|
||||||
|
```
|
||||||
|
|
||||||
3. Choose between two tabs:
|
3. Use the interface to:
|
||||||
- **Local File**: Upload and transcribe audio/video files
|
- Upload and transcribe local audio/video files
|
||||||
- **YouTube**: Process YouTube videos with subtitle extraction
|
- Process YouTube videos
|
||||||
|
- Generate summaries (if Ollama is configured)
|
||||||
|
|
||||||
### Local File Tab
|
## Features in Detail
|
||||||
1. Upload an audio or video file
|
|
||||||
2. Select your preferred model and language settings
|
|
||||||
3. Click "Transcribe" and wait for the results
|
|
||||||
|
|
||||||
### YouTube Tab
|
### Local File Transcription
|
||||||
1. Enter a YouTube URL (supports youtube.com, youtu.be, and invidious URLs)
|
- Supports various audio and video formats
|
||||||
2. Select your preferred model and language settings
|
- Automatic language detection
|
||||||
3. Click "Process Video"
|
- Multiple WhisperX model options
|
||||||
4. The app will:
|
- Optional summarization with Ollama
|
||||||
- First try to extract available subtitles
|
|
||||||
- If no subtitles are available, download and transcribe the video
|
|
||||||
|
|
||||||
## Model Options
|
### YouTube Video Processing
|
||||||
|
- Supports youtube.com, youtu.be, and invidious URLs
|
||||||
|
- Automatically extracts subtitles if available
|
||||||
|
- Falls back to transcription if no subtitles found
|
||||||
|
- Optional summarization with Ollama
|
||||||
|
|
||||||
- tiny: Fastest, lowest accuracy
|
### Summarization
|
||||||
- base: Good balance of speed and accuracy
|
- Uses Ollama for text summarization
|
||||||
- small: Better accuracy, moderate speed
|
- Configurable model selection
|
||||||
- medium: High accuracy, slower
|
- Customizable prompt
|
||||||
- large-v1/v2/v3: Highest accuracy, slowest
|
- Available for both local files and YouTube videos
|
||||||
|
|
||||||
## Tips
|
## Notes
|
||||||
|
|
||||||
- For better accuracy, use larger models (medium, large)
|
- For better accuracy, use larger models (medium, large)
|
||||||
- Processing time increases with model size
|
- Processing time increases with model size
|
||||||
- GPU is recommended for faster processing
|
- GPU is recommended for faster processing
|
||||||
- Maximum audio duration is configurable in config.ini
|
- Maximum audio duration is configurable (default: 60 minutes)
|
||||||
- Use uv for faster package installation and dependency resolution
|
|
||||||
- YouTube videos will first try to use available subtitles
|
- YouTube videos will first try to use available subtitles
|
||||||
- If no subtitles are available, the video will be transcribed
|
- If no subtitles are available, the video will be transcribed
|
||||||
|
- Ollama summarization is optional and requires Ollama to be running
|
||||||
|
|
||||||
## License
|
## License
|
||||||
|
|
||||||
MIT License
|
This project is licensed under the MIT License - see the LICENSE file for details.
|
12
app.py
12
app.py
@ -139,8 +139,10 @@ def transcribe_audio(
|
|||||||
result = model.transcribe(
|
result = model.transcribe(
|
||||||
audio_file,
|
audio_file,
|
||||||
language=language if language != "Auto-detect" else None,
|
language=language if language != "Auto-detect" else None,
|
||||||
beam_size=BEAM_SIZE,
|
batch_size=16, # WhisperX uses batch_size instead of beam_size
|
||||||
vad_filter=VAD_FILTER,
|
vad=(
|
||||||
|
True if VAD_FILTER else False
|
||||||
|
), # WhisperX uses vad instead of vad_filter
|
||||||
)
|
)
|
||||||
|
|
||||||
# Get the full text with timestamps
|
# Get the full text with timestamps
|
||||||
@ -459,8 +461,10 @@ def create_interface():
|
|||||||
result = model.transcribe(
|
result = model.transcribe(
|
||||||
audio,
|
audio,
|
||||||
language=lang if lang != "Auto-detect" else None,
|
language=lang if lang != "Auto-detect" else None,
|
||||||
beam_size=BEAM_SIZE,
|
batch_size=16, # WhisperX uses batch_size instead of beam_size
|
||||||
vad_filter=VAD_FILTER,
|
vad=(
|
||||||
|
True if VAD_FILTER else False
|
||||||
|
), # WhisperX uses vad instead of vad_filter
|
||||||
)
|
)
|
||||||
|
|
||||||
# Get the full text with timestamps
|
# Get the full text with timestamps
|
||||||
|
Loading…
x
Reference in New Issue
Block a user