[ZBX-15428] PCRE regexp seems use much more CPU resources than POSIX Created: 2019 Jan 09  Updated: 2024 Apr 10  Resolved: 2019 Jan 18

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Server (S)
Affects Version/s: 4.0.3
Fix Version/s: 4.0.4rc1, 4.2.0alpha3, 4.2 (plan)

Type: Problem report Priority: Critical
Reporter: Oleksii Zagorskyi Assignee: Vladislavs Sokurenko
Resolution: Fixed Votes: 1
Labels: pcre, regexps
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File 8.42_with_study_but_jit_not_available.png     File ZBX-15428-2.diff     File ZBX-15428-3.4.diff     PNG File pcre.png     PNG File pcre_8.0_with_study.png     PNG File pcre_8.42_from_package_study_without_compile.png     PNG File posix.png     PNG File posix_compile_each_time.png     PNG File regexp_performance.png     PNG File regexp_posix.png     PNG File results.png     PNG File study_and_compile_each_execution.png     PNG File with_pcre_study.png     PNG File with_pcre_study_and_jit_exec.png     PNG File without_pcre_study.png    
Issue Links:
Duplicate
is duplicated by ZBX-15572 history syncer 100% CPU Closed
Team: Team A
Sprint: Sprint 48, Jan 2019
Story Points: 2

 Description   

Looks like PCRE regexps use much more CPU resources than regexp in previous zabbix versions!

We use regexps like

"some|one|other|another"

in trigger and its a trouble.
When we change the trigger expression to evaluate value 4 times by str() function, it helped, but it's a workaround and still cannot be used for all cases.

This should be investigated and addressed.

 

There is also suggestion to move to pcre2  in ZBXNEXT-4922, it potentially might help.



 Comments   
Comment by Glebs Ivanovskis [ 2019 Jan 09 ]

Title says that "PCRE is slower than POSIX ERE", but Decsription compares PCRE to multiple str() calls. I'm pretty sure that POSIX ERE will be slower than equivalent str() calls as well. Where is the proof that PCRE is slower than POSIX ERE?

Comment by Oleksii Zagorskyi [ 2019 Jan 09 ]

Glebs, we did some quick tests, which proved that.
But we need to perform more precise and clean tests, describe them, evaluate and then decide what to do.

Comment by Glebs Ivanovskis [ 2019 Jan 09 ]

Slightly better workaround would be to use global regular expression with Expression type "Any character string included":

Any character string included - match any substring from a delimited list. The delimited list includes a comma (,), a dot (.) or a forward slash (/).

Underneath there is the same strcmp() function as in str().

Comment by Vladislavs Sokurenko [ 2019 Jan 09 ]

Here is CPU usage of all execution when using PCRE:

And POSIX:

Comment by Glebs Ivanovskis [ 2019 Jan 09 ]

What was 100%?

Comment by Vladislavs Sokurenko [ 2019 Jan 09 ]

What was 100%?

I have another picture that might provide some insights about 100% (reading lots of values from cache), but it was long time ago and I might need to repeat tests again to refresh in my memory, main point is that changing to 4 regular expressions or str solved the issue for the user.

Comment by Glebs Ivanovskis [ 2019 Jan 09 ]

Maybe some performance can be gained using pcre_study()?

Maybe Zabbix should optimize trigger expressions like

{host:item.count(1m,<pattern>,regexp)}>3

and stop evaluation of count() early because exact value (100500 or 100501) does not matter?

Maybe Zabbix server should at least log slow triggers?

Comment by Vladislavs Sokurenko [ 2019 Jan 09 ]

Use case here was to find if there are one of 4 strings in line received from log item and it was done for each incoming value with range of 5 minutes, this causes lots of pcre_exec.
You suggestion about using global expression type "Any character string included" should have best possible performance, even using 4 regular expressions would probably(probably because it would involve lots of compilation) be much more faster than using "|" operator but have not tested this yet.

I think we should try pcre2 and see if it solves this issue first and then take it from there.

Comment by Vladislavs Sokurenko [ 2019 Jan 09 ]

Here is with study (actually study took 0% as it is done only when compiling but same compiled expression is reused):

Here is with study and jit_exec

Here is PCRE 8.42 without study:

POSIX:

PCRE 8.0 with study but JIT not available:

PCRE 8.42 with study but JIT not available:

PCRE 8.42 from package with only extra requested by study:

 

Study and compile each time when executing PCRE:

Study and compile each time when executing POSIX:

Actual costs can be provided on demand

Some info:
https://www.pcre.org/original/doc/html/pcrejit.html

https://www.pcre.org/original/doc/html/pcrepartial.html

Comment by Andris Zeila [ 2019 Jan 09 ]

Another option would be to make patterns from the start of the input data, for example with syslog you can try

^[A-Z][a-z]+ +?[0-9]+ [0-9]+:[0-9]+:[0-9]+ [a-z]+ (some|one|other|another)

or even

^.\{21}(some|one|other|another)

Quick tests showed more than 10x performance increase.

 

Comment by Andris Mednis [ 2019 Jan 09 ]

Five invaluable techniques to improve regex performance might help.

Comment by Vladislavs Sokurenko [ 2019 Jan 10 ]

Please try attached patch ZBX-15428-2.diff it will request the JIT compiler to generate code for complete matches. This should be faster if compiled regular expression is used for lots of values for example trigger:

{Zabbix server:foo.regexp(some|one|other|another,5m)}=0

And foo item has 10000 values of length 300 bytes each

zalex_ua Thanks, trying to test on affected zabbix server.

Comment by Glebs Ivanovskis [ 2019 Jan 10 ]

PCRE 8.0 without study:

Do you mean PCRE2? (10.x)

Comment by Vladislavs Sokurenko [ 2019 Jan 10 ]

Sorry, this was with study but without JIT, so using study without JIT also gives performance improvement.

Comment by Vladislavs Sokurenko [ 2019 Jan 10 ]

Fixed in development branch:
svn://svn.zabbix.com/branches/dev/ZBX-15428

Comment by Glebs Ivanovskis [ 2019 Jan 12 ]

Call graphs look cool, but are difficult to read information from. A summary table with numbers like here would be perfect, I think.

Comment by Vladislavs Sokurenko [ 2019 Jan 13 ]

Dear cyclone, thanks allot for your suggestions, pcre_study() does perform better based on graphs, regarding summary tables, I think there are allot of regex engine comparisons available online already, not sure if there will be anything more to add.

Comment by Andris Zeila [ 2019 Jan 15 ]

Tested the following patterns:

  1. one|two|three|four
  2. ^[A-Z][a-z]+ [0-9] [0-9][0-9]:[0-9][0-9]:[0-9][0-9] [a-z]+ one|two|three|four
  3. ^.{21}(one|two|three|four)
  4. ^exact match$
  5. substring
  6. number [0-9]+
  7. ^[A-Z][a-z]+ [0-9] [0-9]:[0-9]:[0-9]+ [a-z]+ log information
  8. ORACLE initialization or shutdown in progress|ORACLE not available|TNS:connection closed|TNS:Connect timeout occurred

Tested 50k iterations with the following data

Jan 15 09:36:30 wbox acvpnagent[1703]: Function: getProfilePath File: ../../vpn/CommonCrypt/Certificates/NSSCertUtils.cpp Line: 567 Invoked Function: getpwnam Return Code: -31457275 (0xFE200005) Description : CERTSTORE_ERROR_NULL_POINTER
Jan 15 09:36:30 wbox acvpnagent[1703]: Function: InitNSS File: ../../vpn/CommonCrypt/Certificates/NSSCertUtils.cpp Line: 397 Invoked Function: CNSSCertUtils::getProfilePath Return Code: -31457275 (0xFE200005) Description: CERTSTORE_ERROR_NULL_POINTER
Jan 15 09:36:30 wbox acvpnagent[1703]: Function: CNSSCertStore File: ../../vpn/CommonCrypt/Certificates/NSSCertStore.cpp Line: 76 Invoked Function: CNSSCertUtils::InitNSS Return Code: -31457275 (0xFE200005) Description: CERTSTORE_ERROR_NULL_POINTER
Jan 15 09:36:30 wbox acvpnagent[1703]: Function: addNSSStore File: ../../vpn/CommonCrypt/Certificates/CollectiveCertStore.cpp Line: 1766 Invoked Function: CNSSCertStore::CNSSCertStore Return Code: -31457275  (0xFE200005) Description: CERTSTORE_ERROR_NULL_POINTER
Jan 15 09:36:30 wbox acvpnagent[1703]: Function: OpenStores File: ../../vpn/CommonCrypt/Certificates/CollectiveCertStore.cpp Line: 423 Invoked Function: CCollectiveCertStore::addNSSStore Return Code: -31457275 (0xFE200005) Description: CERTSTORE_ERROR_NULL_POINTER
Jan 15 09:36:30 wbox acvpnagent[1703]: Function: VerifyCertName File: ../../vpn/CommonCrypt/Certificates/VerifyServerName.cpp Line: 150 Certificate name verification has failed.  Server Name: 129.33.247.60 Subject Alternate Name(s): vpnisv.isv.ihost.com
Jan 15 09:36:30 wbox acvpnagent[1703]: Function: VerifyCertName File: ../../vpn/CommonCrypt/Certificates/OpenSSLCertUtils.cpp Line: 1421 Invoked Function: CVerifyServerName::VerifyCertName Return Code: -313 91725 (0xFE210013) Description: CERTIFICATE_ERROR_VERIFY_NAME_FAILED
Jan 15 09:36:30 wbox acvpnagent[1703]: Function: analyzeHttpResponse File: ../../vpn/Agent/NetEnvironment.cpp Line: 1596 Invoked Function: CCertHelper::VerifyServerCertificate Return Code: -31391725 (0xFE210013) Description: CERTIFICATE_ERROR_VERIFY_NAME_FAILED server name: 1.1.1.1
Jan 15 09:36:30 wbox acvpnagent[1703]: Server certificate validation failed with the following errors:  #011Certificate does not match the server name.
Jan 15 09:36:30 wbox acvpnagent[1703]: Function: TestAccessToSG File: ../../vpn/Agent/NetEnvironment.cpp Line: 1298 Invoked Function: CNetEnvironment::analyzeHttpResponse Return Code: -28966897 (0xFE46000F) Description: NETENVIRONMENT_ERROR_CERT_VERIFICATION_FAILED:The server cert verification performed after the HTTPS probe has failed
Jan 15 09:36:32 wbox acvpnagent[1703]: Function: OnOpenRequestComplete File: ../../vpn/Common/IP/HttpProbeAsync.cpp Line: 304 Invoked Function: CHttpSessionAsync::OnOpenRequestComplete Return Code: -31588307 (0xFE1E002D) Description: SOCKETTRANSPORT_ERROR_CONNECT_CANCELED:An asynchronous connection has been canceled during its initiation.
Jan 15 09:36:32 wbox acvpnagent[1703]: Function: logProbeFailure File: ../../vpn/Agent/NetEnvironment.cpp Line: 1437 Invoked Function: CHttpProbeAsync::SendProbe Return Code: -27066354 (0xFE63000E) Description: HTTP_PROBE_ASYNC_ERROR_CANNOT_CONNECT HTTP (host: 1.1.1.1)
Jan 15 09:36:32 wbox acvpnagent[1703]: Function: TestAccessToSG File: ../../vpn/Agent/NetEnvironment.cpp Line: 1390 Invoked Function: CNetEnvironment::analyzeHttpResponse Return Code: -28966899 (0xFE46000D) Description: NETENVIRONMENT_ERROR_PROBE_INCOMPLETE:Network Probe could not contact target
Jan 15 09:36:32 wbox acvpnagent[1703]: Function: testNetwork File: ../../vpn/Agent/NetEnvironment.cpp Line: 783 Invoked Function: CNetEnvironment::IsSGAccessible Return Code: -28966899 (0xFE46000D) Description: NETENVIRONMENT_ERROR_PROBE_INCOMPLETE:Network Probe could not contact target
Jan 15 09:36:32 wbox acvpnagent[1703]: Function: TestNetEnv File: ../../vpn/Agent/NetEnvironment.cpp Line: 268 Incomplete probe count is 3
Jan 15 09:36:32 wbox acvpnagent[1703]: Function: TestNetEnv File: ../../vpn/Agent/NetEnvironment.cpp Line: 370 Network accessible. Retesting connectivity to the secure gateway in 30 seconds.
Jan 15 09:37:03 wbox acvpnagent[1703]: Function: getDnsConfiguration File: ../../vpn/Common/Utility/NetInterface_unix.cpp Line: 1253 nm_ip4_config_get_domains: Unable to get domain list.
Jan 15 09:37:03 wbox acvpnagent[1703]: Function: getDnsConfiguration File: ../../vpn/Common/Utility/NetInterface_unix.cpp Line: 1266 nm_ip6_config_get_nameservers: Unable to get name server list.
Jan 15 09:37:03 wbox acvpnagent[1703]: Function: EnumerateInterfaces File: ../../vpn/Common/Utility/NetInterface_unix.cpp Line: 1385 Invoked Function: CNetInterface::getDnsConfiguration Return Code: -32636923 (0xFE0E0005) Description: NETINTERFACE_ERROR_NULL_POINTER eno1
Jan 15 09:37:03 wbox acvpnagent[1703]: Function: getDnsConfiguration File: ../../vpn/Common/Utility/NetInterface_unix.cpp Line: 1198 nm_ip4_config_get_domains: Unable to get domain list.

Total 1m matches performed, batch size is the number of matches without regular expression recompilation.

Comment by Glebs Ivanovskis [ 2019 Jan 15 ]

Looks like users will have to choose themselves which strategy to use... Which is somewhat logical if users can direct access to regexp engine and can use any pattern imaginable.

Comment by Andris Zeila [ 2019 Jan 15 ]

Study seems to have minimal perormance impact in simple cases and gives real improvement in some corner cases. While JIT seem to be better with batches of 8-32 values, depending on pattern. For trigger expressions a future optimization could be switching to JIT if regexp match has to be performed on 30+ values.

wiper: Just to clarify, we are using pcre_study without jit now.

Comment by Vladislavs Sokurenko [ 2019 Jan 15 ]

Fixed in:

  • pre-4.0.4rc1 r88673
  • pre-4.2.0alpha3 (trunk) r88674
Comment by Glebs Ivanovskis [ 2019 Feb 01 ]

Similar issue: ZBX-15572.

Generated at Fri Apr 26 17:50:36 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.