Dot-Net
翻譯大量文本數據的最佳方法是什麼?
我有很多文本數據,想將其翻譯成不同的語言。
我知道的可能方式:
問題是所有這些服務在文本長度、呼叫次數等方面都有限制,這使得它們使用起來很不方便。
在這種情況下,您可以建議使用哪些服務/方式?
在將語言翻譯與XMPP聊天伺服器集成時,我必須解決同樣的問題。我將我的有效載荷(我需要翻譯的文本)劃分為更小的完整句子子集。
我不記得確切的數字了,但是使用 Google 的基於REST的翻譯 URL,我翻譯了一組完整的句子,這些句子總共少於(或等於)1024 個字元,因此一個大段落會導致多個翻譯服務電話。
將您的大文本分解為標記化的字元串,然後通過循環將每個標記傳遞給翻譯器。將翻譯後的輸出儲存在一個數組中,一旦所有標記都被翻譯並儲存在數組中,將它們放回原處,您將擁有一個完全翻譯的文件。
只是為了證明一點,我把它放在一起:) 它的邊緣很粗糙,但它可以處理大量的文本,而且它的翻譯準確性與穀歌一樣好,因為它使用Google API。我使用此程式碼處理了 Apple 的整個 2005 年 SEC 10-K 文件並點擊一個按鈕(大約需要 45 分鐘)。
結果與一次將一個句子複製並粘貼到Google翻譯中得到的結果基本相同。它並不完美(結束標點不准確,我沒有逐行寫入文本文件),但它確實顯示了概念證明。如果您更多地使用 Regex,它可能會有更好的標點符號。
Imports System.IO Imports System.Text.RegularExpressions Public Class Form1 Dim file As New String("Translate Me.txt") Dim lineCount As Integer = countLines() Private Function countLines() If IO.File.Exists(file) Then Dim reader As New StreamReader(file) Dim lineCount As Integer = Split(reader.ReadToEnd.Trim(), Environment.NewLine).Length reader.Close() Return lineCount Else MsgBox(file + " cannot be found anywhere!", 0, "Oops!") End If Return 1 End Function Private Sub translateText() Dim lineLoop As Integer = 0 Dim currentLine As String Dim currentLineSplit() As String Dim input1 As New StreamReader(file) Dim input2 As New StreamReader(file) Dim filePunctuation As Integer = 1 Dim linePunctuation As Integer = 1 Dim delimiters(3) As Char delimiters(0) = "." delimiters(1) = "!" delimiters(2) = "?" Dim entireFile As String entireFile = (input1.ReadToEnd) For i = 1 To Len(entireFile) If Mid$(entireFile, i, 1) = "." Then filePunctuation += 1 Next For i = 1 To Len(entireFile) If Mid$(entireFile, i, 1) = "!" Then filePunctuation += 1 Next For i = 1 To Len(entireFile) If Mid$(entireFile, i, 1) = "?" Then filePunctuation += 1 Next Dim sentenceArraySize = filePunctuation + lineCount Dim sentenceArrayCount = 0 Dim sentence(sentenceArraySize) As String Dim sentenceLoop As Integer While lineLoop < lineCount linePunctuation = 1 currentLine = (input2.ReadLine) For i = 1 To Len(currentLine) If Mid$(currentLine, i, 1) = "." Then linePunctuation += 1 Next For i = 1 To Len(currentLine) If Mid$(currentLine, i, 1) = "!" Then linePunctuation += 1 Next For i = 1 To Len(currentLine) If Mid$(currentLine, i, 1) = "?" Then linePunctuation += 1 Next currentLineSplit = currentLine.Split(delimiters) sentenceLoop = 0 While linePunctuation > 0 Try Dim trans As New Google.API.Translate.TranslateClient("") sentence(sentenceArrayCount) = trans.Translate(currentLineSplit(sentenceLoop), Google.API.Translate.Language.English, Google.API.Translate.Language.German, Google.API.Translate.TranslateFormat.Text) sentenceLoop += 1 linePunctuation -= 1 sentenceArrayCount += 1 Catch ex As Exception sentenceLoop += 1 linePunctuation -= 1 End Try End While lineLoop += 1 End While Dim newFile As New String("Translated Text.txt") Dim outputLoopCount As Integer = 0 Using output As StreamWriter = New StreamWriter(newFile) While outputLoopCount < sentenceArraySize output.Write(sentence(outputLoopCount) + ". ") outputLoopCount += 1 End While End Using input1.Close() input2.Close() End Sub Private Sub translateButton_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles translateButton.Click translateText() End Sub End Class