訓練 OpenAI Whisper V2-幫你的影片上字幕

更新於 2024/10/22發佈於 2023/10/23閱讀時間約 9 分鐘

前言:

幫影片上字幕好麻煩也好無聊，使用網路上的上字幕服務又會擔心腳本外洩，被人侵害著作權，要解決以上痛點，現在有很好用的模型可以自行訓練，讓您製作影片可以把重心放在內容與其他呈現方式。相信本文介紹的解法，你會喜歡💕，想知道更多可以參加免費諮詢，讓我多瞭解各種生活上的痛點，好讓我可以逐步提出解決方案給大家！

大名鼎鼎的openai推出Whisper-large-v2 AI 的第二版，用於語音辨識與翻譯的預訓練模型，Whisper 是Alec Radford 等人在論文Robust Speech Recognition via Large-Scale Weak Supervision中提出的。來自 OpenAI。原始程式碼存儲庫可以在這裡找到。

可以用來辨識多國語言，有興趣的人可以先玩玩看範例

範例包含一個語音檔，以及翻譯後的文字

影片上字幕程式片段範例:

以下為範例程式，輸入為影片音檔，輸出為翻譯以及語音出現的起始時間與終點時間

import torch
from transformers import pipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"

pipe = pipeline(
  "automatic-speech-recognition",
  model="openai/whisper-large-v2",
  chunk_length_s=30,
  device=device,
)

ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = ds[0]["audio"]

prediction = pipe(sample.copy(), batch_size=8)["text"]

# we can also return timestamps for the predictions
prediction = pipe(sample.copy(), batch_size=8, return_timestamps=True)["chunks"]

#[{'text': ' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.',
#  'timestamp': (0.0, 5.44)}]

影片字幕翻譯程式片段範例:

下面為字幕自動由法文翻譯成英文的範例

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import Audio, load_dataset

# load model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v2")
forced_decoder_ids = processor.get_decoder_prompt_ids(language="french", task="translate")

# load streaming dataset and read first audio sample
ds = load_dataset("common_voice", "fr", split="test", streaming=True)
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
input_speech = next(iter(ds))["audio"]
input_features = processor(input_speech["array"], sampling_rate=input_speech["sampling_rate"], return_tensors="pt").input_features

# generate token ids
predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
# decode token ids to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

# [' A very interesting work, we will finally be given on this subject.']

模型簡介:

Whisper 是一種基於 Transformer 的編碼器-解碼器模型，也稱為Seq-to-seq模型。它接受了 68 萬小時的標記語音資料的訓練，這些資料使用大規模弱監督進行註釋。

這些模型是根據純英語資料或多語言資料進行訓練的。僅英語模型接受了語音辨識任務的訓練。多語言模型接受了語音辨識和語音翻譯的訓練。對於語音識別，該模型會預測與音訊相同語言的轉錄。對於語音翻譯，該模型會預測轉錄為與音訊不同的語言。

Whisper 檢查點有五種不同型號尺寸的配置。最小的四個接受純英語或多語言資料的訓練。最大的檢查站僅支援多種語言。Hugging Face Hub上提供了所有十個預先訓練的預訓練存檔模型

Finetune Whisper模型:

FineTune可以讓模型針對特定任務加以強化，由於預訓練模型是使用英文音檔訓練而成，如果要翻譯中文字幕，則需要對中文的語音資料集Finetune，所幸Mozilla資料集裡面有大量的繁體中文與台灣口音的資料，Finetune起來給台灣人專用也不成問題。

關於Finetune 模型的方法，可以在以下網址找到

https://github.com/openai/whisper/discussions/988

訓練資料採用非營利組織Mozilla所提供的大型多語言監督學習資料

https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0

用電腦自行Finetune成功畫面，Finetune完以後便可自動替所有影片產生字幕SRT檔

心得:

各種基於Transformer邊解碼器的大型語言模型不斷地冒出來，準確率高，對語言的理解能力強，超乎大眾所想像，相較於傳統的人類上字幕，Openai 公開發表的Whisper V2 可以快速準確且全年無休的把字幕檔產生出來，未來對於內容創作者，可以說是一大福音!

引用:

@misc{radford2022whisper,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}