Streaming transcriber with whisper
Go to file
Yuta Hayashibe 40dd471f6b Fix bugs
2022-10-17 21:29:12 +09:00
.github Bump crate-ci/typos from 1.12.8 to 1.12.10 2022-10-17 16:17:24 +09:00
scripts Fix for Windows 2022-10-10 00:42:06 +09:00
tests Fix bugs 2022-10-17 21:29:12 +09:00
whispering Fix bugs 2022-10-17 21:29:12 +09:00
.gitignore Initial commit 2022-09-23 19:20:29 +09:00
.markdownlint.json Initial commit 2022-09-23 19:20:29 +09:00
LICENSE Initial commit 2022-09-23 19:20:29 +09:00
LICENSE.whisper Initial commit 2022-09-23 19:20:29 +09:00
Makefile Fix for Windows 2022-10-10 00:42:06 +09:00
package-lock.json Bump pyright from 1.1.274 to 1.1.275 2022-10-17 16:17:09 +09:00
package.json Bump pyright from 1.1.274 to 1.1.275 2022-10-17 16:17:09 +09:00
poetry.lock Updated numpy to 1.23.4 2022-10-15 13:10:15 +09:00
pyproject.toml v0.6.1dev0 2022-10-15 14:59:22 +09:00
README.md v0.6.0 2022-10-15 14:58:37 +09:00
setup.cfg Fix setting for isort 2022-09-23 20:05:33 +09:00

Whispering

MIT License Python Versions

CI CodeQL Typos

Streaming transcriber with whisper. Enough machine power is needed to transcribe in real time.

Setup

pip install -U git+https://github.com/shirayu/whispering.git@v0.6.0

# If you use GPU, install proper torch and torchaudio
# Check https://pytorch.org/get-started/locally/
# Example : torch for CUDA 11.6
pip install -U torch torchaudio --extra-index-url https://download.pytorch.org/whl/cu116

If you get OSError: PortAudio library not found in Linux, install "PortAudio".

sudo apt -y install portaudio19-dev

Example of microphone

# Run in English
whispering --language en --model tiny
  • --help shows full options
  • --model sets the model name to use. Larger models will be more accurate, but may not be able to transcribe in real time.
  • --language sets the language to transcribe. The list of languages are shown with whispering -h
  • --no-progress disables the progress message
  • -t sets temperatures to decode. You can set several like -t 0.0 -t 0.1 -t 0.5, but too many temperatures exhaust decoding time
  • --debug outputs logs for debug
  • --vad sets VAD (Voice Activity Detection) threshold. The default is 0.5. 0 disables VAD and forces whisper to analyze non-voice activity sound period
  • --output sets output file (Default: Standard output)

Parse interval

By default, whispering performs VAD for every 3.75 second. This interval is determined by the value of -n and its default is 20. When an interval is predicted as "silence", it will not be passed to whisper. If you want to disable VAD, please make VAD threshold 0 by adding --vad 0.

By default, Whisper does not perform analysis until the total length of the segments determined by VAD to have speech exceeds 30 seconds. However, if silence segments appear 16 times (the default value of --max_nospeech_skip) after speech is detected, the analysis is performed.

Example of web socket

No security mechanism. Please make secure with your responsibility.

Run with --host and --port.

Host

whispering --language en --model tiny --host 0.0.0.0 --port 8000

Client

whispering --host ADDRESS_OF_HOST --port 8000 --mode client

You can set -n and other options.

License