Show HN: Willow Inference Server: Optimized ASR/TTS/LLM for Willow/WebRTC/REST https://ift.tt/MgwLVuX

Show HN: Willow Inference Server: Optimized ASR/TTS/LLM for Willow/WebRTC/REST Hey HN! Willow Inference Server (WIS) is a focused and highly optimized language inference server implementation. Our goal is to "automagically" enable performant, cost-effective self-hosting of released state of the art/best of breed models to enable speech and language tasks: Primarily targeting CUDA (works on CPU too) with support for low-end (cheap) devices such as the Tesla P4, GTX 1060, and up. Don't worry - it screams on an RTX 4090 too! (See benchmarks on Github). Memory optimized - all three default Whisper (base, medium, large-v2) models loaded simultaneously with TTS support inside of 6GB VRAM. LLM support defaults to int4 quantization (conversion scripts included). ASR/STT + TTS + Vicuna 13B require roughly 18GB VRAM. Less for 7B, of course! ASR. Heavy emphasis - Whisper optimized for very high quality as-close-to-real-time-as-possible speech recognition via a variety of means (Willow, WebRTC, POST a file, integration with devices and client applications, etc). Results in hundreds of milliseconds or less for most intended speech tasks. See YouTube WebRTC demo[0]. TTS. Primarily provided for assistant tasks (like Willow!) and visually impaired users. LLM. Optionally pass input through a provided/configured LLM for question answering, chatbot, and assistant tasks. Currently supports LLaMA deriviates with strong preference for Vicuna (I like 13B). Built in support for quantization to int4 to conserve GPU memory. Support for a variety of transports. REST, WebRTC, Web Sockets (primarily for LLM). Performance and memory optimized. Leverages CTranslate2 for Whisper support and AutoGPTQ for LLMs. Willow support. WIS powers the Tovera hosted best-effort example server Willow users enjoy. Support for WebRTC - stream audio in real-time from browsers or WebRTC applications to optimize quality and response time. Heavily optimized for long-running sessions using WebRTC audio track management. Leave your session open for days at a time and have self-hosted ASR transcription within hundreds of milliseconds while conserving network bandwidth and CPU! Support for custom TTS voices. With relatively small audio recordings WIS can create and manage custom TTS voices. See API documentation for more information. Much like the release of Willow[1] last week this is an early release but we had a great response from HN and are looking forward to hearing what everyone thinks! [0] - https://www.youtube.com/watch?v=PxCO5eONqSQ [1] - https://ift.tt/PCAojtR https://ift.tt/xaQ295e May 23, 2023 at 07:22AM

SouthendTV-206

Search This Blog

Show HN: Willow Inference Server: Optimized ASR/TTS/LLM for Willow/WebRTC/REST https://ift.tt/MgwLVuX

Comments

Post a Comment