Dot-Net

翻譯大量文本數據的最佳方法是什麼?

  • July 18, 2021

我有很多文本數據,想將其翻譯成不同的語言。

我知道的可能方式:

問題是所有這些服務在文本長度、呼叫次數等方面都有限制,這使得它們使用起來很不方便。

在這種情況下,您可以建議使用哪些服務/方式?

在將語言翻譯與XMPP聊天伺服器集成時,我必須解決同樣的問題。我將我的有效載荷(我需要翻譯的文本)劃分為更小的完整句子子集。

我不記得確切的數字了,但是使用 Google 的基於REST的翻譯 URL,我翻譯了一組完整的句子,這些句子總共少於(或等於)1024 個字元,因此一個大段落會導致多個翻譯服務電話。

將您的大文本分解為標記化的字元串,然後通過循環將每個標記傳遞給翻譯器。將翻譯後的輸出儲存在一個數組中,一旦所有標記都被翻譯並儲存在數組中,將它們放回原處,您將擁有一個完全翻譯的文件。

只是為了證明一點,我把它放在一起:) 它的邊緣很粗糙,但它可以處理大量的文本,而且它的翻譯準確性與穀歌一樣好,因為它使用Google API。我使用此程式碼處理了 Apple 的整個 2005 年 SEC 10-K 文件並點擊一個按鈕(大約需要 45 分鐘)。

結果與一次將一個句子複製並粘貼到Google翻譯中得到的結果基本相同。它並不完美(結束標點不准確,我沒有逐行寫入文本文件),但它確實顯示了概念證明。如果您更多地使用 Regex,它可能會有更好的標點符號。

Imports System.IO
Imports System.Text.RegularExpressions

Public Class Form1

   Dim file As New String("Translate Me.txt")
   Dim lineCount As Integer = countLines()

   Private Function countLines()

       If IO.File.Exists(file) Then

           Dim reader As New StreamReader(file)
           Dim lineCount As Integer = Split(reader.ReadToEnd.Trim(), Environment.NewLine).Length
           reader.Close()
           Return lineCount

       Else

           MsgBox(file + " cannot be found anywhere!", 0, "Oops!")

       End If

       Return 1

   End Function

   Private Sub translateText()

       Dim lineLoop As Integer = 0
       Dim currentLine As String
       Dim currentLineSplit() As String
       Dim input1 As New StreamReader(file)
       Dim input2 As New StreamReader(file)
       Dim filePunctuation As Integer = 1
       Dim linePunctuation As Integer = 1

       Dim delimiters(3) As Char
       delimiters(0) = "."
       delimiters(1) = "!"
       delimiters(2) = "?"

       Dim entireFile As String
       entireFile = (input1.ReadToEnd)

       For i = 1 To Len(entireFile)
           If Mid$(entireFile, i, 1) = "." Then filePunctuation += 1
       Next

       For i = 1 To Len(entireFile)
           If Mid$(entireFile, i, 1) = "!" Then filePunctuation += 1
       Next

       For i = 1 To Len(entireFile)
           If Mid$(entireFile, i, 1) = "?" Then filePunctuation += 1
       Next

       Dim sentenceArraySize = filePunctuation + lineCount

       Dim sentenceArrayCount = 0
       Dim sentence(sentenceArraySize) As String
       Dim sentenceLoop As Integer

       While lineLoop < lineCount

           linePunctuation = 1

           currentLine = (input2.ReadLine)

           For i = 1 To Len(currentLine)
               If Mid$(currentLine, i, 1) = "." Then linePunctuation += 1
           Next

           For i = 1 To Len(currentLine)
               If Mid$(currentLine, i, 1) = "!" Then linePunctuation += 1
           Next

           For i = 1 To Len(currentLine)
               If Mid$(currentLine, i, 1) = "?" Then linePunctuation += 1
           Next

           currentLineSplit = currentLine.Split(delimiters)
           sentenceLoop = 0

           While linePunctuation > 0

               Try

                   Dim trans As New Google.API.Translate.TranslateClient("")
                   sentence(sentenceArrayCount) = trans.Translate(currentLineSplit(sentenceLoop), Google.API.Translate.Language.English, Google.API.Translate.Language.German, Google.API.Translate.TranslateFormat.Text)
                   sentenceLoop += 1
                   linePunctuation -= 1
                   sentenceArrayCount += 1

               Catch ex As Exception

                   sentenceLoop += 1
                   linePunctuation -= 1

               End Try

           End While

           lineLoop += 1

       End While

       Dim newFile As New String("Translated Text.txt")
       Dim outputLoopCount As Integer = 0

       Using output As StreamWriter = New StreamWriter(newFile)

           While outputLoopCount < sentenceArraySize

               output.Write(sentence(outputLoopCount) + ". ")

               outputLoopCount += 1

           End While

       End Using

       input1.Close()
       input2.Close()

   End Sub

   Private Sub translateButton_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles translateButton.Click

       translateText()

   End Sub

End Class

引用自:https://stackoverflow.com/questions/2448097