mirror of https://github.com/shirayu/whispering.git synced 2024-11-25 02:11:00 +00:00

Streaming transcriber with whisper

Find a file

Yuta Hayashibe 3d293c868c Updated numpy to 1.23.4		2022-10-15 13:10:15 +09:00
.github	Resolve "would clobber existing tag"	2022-10-12 11:26:23 +09:00
scripts	Fix for Windows	2022-10-10 00:42:06 +09:00
tests	Add test	2022-10-10 00:01:31 +09:00
whispering	Add test	2022-10-10 00:01:31 +09:00
.gitignore	Initial commit	2022-09-23 19:20:29 +09:00
.markdownlint.json	Initial commit	2022-09-23 19:20:29 +09:00
LICENSE	Initial commit	2022-09-23 19:20:29 +09:00
LICENSE.whisper	Initial commit	2022-09-23 19:20:29 +09:00
Makefile	Fix for Windows	2022-10-10 00:42:06 +09:00
package-lock.json	Bump pyright from 1.1.273 to 1.1.274	2022-10-10 17:42:29 +09:00
package.json	Bump pyright from 1.1.273 to 1.1.274	2022-10-10 17:42:29 +09:00
poetry.lock	Updated numpy to 1.23.4	2022-10-15 13:10:15 +09:00
pyproject.toml	Update developing version	2022-10-12 11:40:43 +09:00
README.md	Updated document	2022-10-15 12:56:08 +09:00
setup.cfg	Fix setting for isort	2022-09-23 20:05:33 +09:00

README.md

Whispering

Streaming transcriber with whisper. Enough machine power is needed to transcribe in real time.

Setup

pip install -U git+https://github.com/shirayu/whispering.git@v0.5.1

# If you use GPU, install proper torch and torchaudio
# Check https://pytorch.org/get-started/locally/
# Example : torch for CUDA 11.6
pip install -U torch torchaudio --extra-index-url https://download.pytorch.org/whl/cu116

If you get OSError: PortAudio library not found in Linux, install "PortAudio".

sudo apt -y install portaudio19-dev

Example of microphone

# Run in English
whispering --language en --model tiny

--help shows full options
--model set the model name to use. Larger models will be more accurate, but may not be able to transcribe in real time.
--language sets the language to transcribe. The list of languages are shown with whispering -h
--no-progress disables the progress message
-t sets temperatures to decode. You can set several like -t 0.0 -t 0.1 -t 0.5, but too many temperatures exhaust decoding time
--debug outputs logs for debug
--no-vad disables VAD (Voice Activity Detection). This forces whisper to analyze non-voice activity sound period
--output sets output file (Default: Standard output)

Parse interval

By default, whispering performs VAD for every 3.75 second. This interval is determined by the value of -n and its default is 20. When an interval is predicted as "silence", it will not be passed to whisper. If you want to disable VAD, please use --no-vad option.

By default, Whisper does not perform analysis until the total length of the segments determined by VAD to have speech exceeds 30 seconds. This is because Whisper is trained to make predictions for 30-second intervals. Nevertheless, if you want to force Whisper to perform analysis even if a segment is less than 30 seconds, please use --allow-padding option like this.

whispering --language en --model tiny -n 20 --allow-padding

This forces Whisper to analyze every 3.75 seconds speech segment. Using --allow-padding may sacrifice the accuracy, while you can get quick response. The smaller value of -n with --allow-padding is, the worse the accuracy becomes.

Example of web socket

⚠ No security mechanism. Please make secure with your responsibility.

Run with --host and --port.

Host

whispering --language en --model tiny --host 0.0.0.0 --port 8000

Client

whispering --host ADDRESS_OF_HOST --port 8000 --mode client

You can set -n, --allow-padding and other options.

License

MIT License
Some codes are ported from the original whisper. Its license is also MIT License