Dot-Net

為什麼編譯的 RegEx 性能比 Intrepreted RegEx 慢?

  • May 14, 2011

我遇到了這篇文章:

性能:編譯與解釋正則表達式,我修改了範常式式碼以編譯 1000 個正則表達式,然後每個執行 500 次以利用預編譯,但即使在這種情況下,解釋正則表達式的執行速度也快 4 倍!

這意味著RegexOptions.Compiled選項完全沒用,實際上更糟糕的是,它更慢!很大的不同是由於 JIT,在以下程式碼中解決了 JIT 編譯的正則表達式後仍然執行有點慢,對我來說沒有意義,但答案中的@Jim 提供了一個更乾淨的版本,可以按預期工作

誰能解釋為什麼會這樣?

從部落格文章中獲取和修改的程式碼:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;

namespace RegExTester
{
   class Program
   {
       static void Main(string[] args)
       {
           DateTime startTime = DateTime.Now;

           for (int i = 0; i < 1000; i++)
           {
               CheckForMatches("some random text with email address, address@domain200.com" + i.ToString());    
           }


           double msTaken = DateTime.Now.Subtract(startTime).TotalMilliseconds;
           Console.WriteLine("Full Run: " + msTaken);


           startTime = DateTime.Now;

           for (int i = 0; i < 1000; i++)
           {
               CheckForMatches("some random text with email address, address@domain200.com" + i.ToString());
           }


           msTaken = DateTime.Now.Subtract(startTime).TotalMilliseconds;
           Console.WriteLine("Full Run: " + msTaken);

           Console.ReadLine();

       }


       private static List<Regex> _expressions;
       private static object _SyncRoot = new object();

       private static List<Regex> GetExpressions()
       {
           if (_expressions != null)
               return _expressions;

           lock (_SyncRoot)
           {
               if (_expressions == null)
               {
                   DateTime startTime = DateTime.Now;

                   List<Regex> tempExpressions = new List<Regex>();
                   string regExPattern =
                       @"^[a-zA-Z0-9]+[a-zA-Z0-9._%-]*@{0}$";

                   for (int i = 0; i < 2000; i++)
                   {
                       tempExpressions.Add(new Regex(
                           string.Format(regExPattern,
                           Regex.Escape("domain" + i.ToString() + "." +
                           (i % 3 == 0 ? ".com" : ".net"))),
                           RegexOptions.IgnoreCase));//  | RegexOptions.Compiled
                   }

                   _expressions = new List<Regex>(tempExpressions);
                   DateTime endTime = DateTime.Now;
                   double msTaken = endTime.Subtract(startTime).TotalMilliseconds;
                   Console.WriteLine("Init:" + msTaken);
               }
           }

           return _expressions;
       }

       static  List<Regex> expressions = GetExpressions();

       private static void CheckForMatches(string text)
       {

           DateTime startTime = DateTime.Now;


               foreach (Regex e in expressions)
               {
                   bool isMatch = e.IsMatch(text);
               }


           DateTime endTime = DateTime.Now;
           //double msTaken = endTime.Subtract(startTime).TotalMilliseconds;
           //Console.WriteLine("Run: " + msTaken);

       }
   }
}

當按預期使用時,編譯的正則表達式匹配得更快。正如其他人指出的那樣,我們的想法是編譯一次並多次使用它們。構造和初始化時間在這些多次執行中攤銷。

我創建了一個更簡單的測試,它將向您展示編譯的正則表達式無疑比未編譯的要快。

   const int NumIterations = 1000;
   const string TestString = "some random text with email address, address@domain200.com";
   const string Pattern = "^[a-zA-Z0-9]+[a-zA-Z0-9._%-]*@domain0\\.\\.com$";
   private static Regex NormalRegex = new Regex(Pattern, RegexOptions.IgnoreCase);
   private static Regex CompiledRegex = new Regex(Pattern, RegexOptions.IgnoreCase | RegexOptions.Compiled);
   private static Regex DummyRegex = new Regex("^.$");

   static void Main(string[] args)
   {
       var DoTest = new Action<string, Regex, int>((s, r, count) =>
           {
               Console.Write("Testing {0} ... ", s);
               Stopwatch sw = Stopwatch.StartNew();
               for (int i = 0; i < count; ++i)
               {
                   bool isMatch = r.IsMatch(TestString + i.ToString());
               }
               sw.Stop();
               Console.WriteLine("{0:N0} ms", sw.ElapsedMilliseconds);
           });

       // Make sure that DoTest is JITed
       DoTest("Dummy", DummyRegex, 1);
       DoTest("Normal first time", NormalRegex, 1);
       DoTest("Normal Regex", NormalRegex, NumIterations);
       DoTest("Compiled first time", CompiledRegex, 1);
       DoTest("Compiled", CompiledRegex, NumIterations);

       Console.WriteLine();
       Console.Write("Done. Press Enter:");
       Console.ReadLine();
   }

設置NumIterations為 500 給了我這個:

Testing Dummy ... 0 ms
Testing Normal first time ... 0 ms
Testing Normal Regex ... 1 ms
Testing Compiled first time ... 13 ms
Testing Compiled ... 1 ms

通過 500 萬次迭代,我得到:

Testing Dummy ... 0 ms
Testing Normal first time ... 0 ms
Testing Normal Regex ... 17,232 ms
Testing Compiled first time ... 17 ms
Testing Compiled ... 15,299 ms

在這裡您可以看到編譯後的正則表達式比未編譯的版本至少快 10%。

有趣的是,如果RegexOptions.IgnoreCase從正則表達式中刪除 50​​0 萬次迭代的結果會更加驚人:

Testing Dummy ... 0 ms
Testing Normal first time ... 0 ms
Testing Normal Regex ... 12,869 ms
Testing Compiled first time ... 14 ms
Testing Compiled ... 8,332 ms

在這裡,編譯的正則表達式比未編譯的正則表達式快 35%。

在我看來,您引用的部落格文章只是一個有缺陷的測試。

引用自:https://stackoverflow.com/questions/6004819