MLX - Whisper 教學 - 1

更新於 2024/12/10發佈於 2023/12/10閱讀時間約 4 分鐘

在開始前，請先至 GitHub 上 Clone 相關資源到環境裡。

$ git clone https://github.com/ml-explore/mlx-examples.git

完成後，進入資料夾

$ cd ./mlx-examples/whisper

裡面有以下資源：

# whisper
.
|-- whisper							# MLX 專用的 whisper 套件
|-- README.md
|-- benchmark.py				# 評估效能
|-- requirements.txt		# 環境檔
`-- test.py							# 測試

根據 README 指示，先安裝好 Python 環境，Python 版本筆者是使用 3.9。

pip install -r requirements.txt

接著安裝多媒體轉換函式庫，這邊是使用 macOS 的 brew 來安裝，brew 安裝方法請直接到官網上複製指令到終端機上執行即可，完成後也不要忘記再執行它提示的指令。

brew install ffmpeg

在這邊就完成環境的準備了。

接著，文檔也提供了一個範本，如下

import whisper

text = whisper.transcribe(speech_file)["text"]

把 speech_file 改成檔案物件或是檔案路徑，接著直接執行就會直接開始轉換。

但文件並沒有提到其他的使用方法，因此筆者就翻了一下原始碼...

# ./whisper/whisper/transcribe.py

def transcribe(
    audio: Union[str, np.ndarray, mx.array],
    *,
    model: str = "tiny",
    verbose: Optional[bool] = None,
    temperature: Union[float, Tuple[float, ...]] = (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
    compression_ratio_threshold: Optional[float] = 2.4,
    logprob_threshold: Optional[float] = -1.0,
    no_speech_threshold: Optional[float] = 0.6,
    condition_on_previous_text: bool = True,
    initial_prompt: Optional[str] = None,
    prepend_punctuations: str = "\"'“¿([{-",
    append_punctuations: str = "\"'.。,，!！?？:：”)]}、",
    **decode_options,
):

...

發現是有其他常見的選項可以調整，像是：

model：tiny, base, small, medium, large，調整模型的大小。
verbose：True, False，是否要印出狀態，如果是 True 在轉換時就會一直印出轉好的文字與時間戳。
temperature：0.0, 0.2, 0.4, 0.6, 0.8, 1.0，語句的溫度。
initial_prompt：起始的文字，可以讓模型預先了解接下來的轉譯內容，若有專有名詞無法被正確辨識，可以先在這邊加上一些提示詞。
decode_options：fp16,...。可以改用 16 精度的浮點運算，預設為 32 精度。
語言的部分似乎是自動辨識，一開始會需要約 30 秒來辨識語言。

以下是筆者自己實驗用的程式碼：

import whisper

path = './TEST.mp3'



fileRoot = '/'.join(path.split('/')[:-1])
fileName = path.split('/')[-1].split('.')[0]

text = whisper.transcribe(path, model='medium', verbose=True)["text"]

open(f'{fileRoot}/{fileName}-whisper.txt', 'w').write(text)