[#ZBX-15899] Inconsistent behavior of regex for UTF-8 input

[ZBX-15899] Inconsistent behavior of regex for UTF-8 input Created: 2019 Mar 29 Updated: 2024 Apr 10 Resolved: 2019 Jun 08
Status:	Closed
Project:	ZABBIX BUGS AND ISSUES
Component/s:	Agent (G), Frontend (F), Proxy (P), Server (S)
Affects Version/s:	4.2.0rc2
Fix Version/s:	4.0.10rc1, 4.2.4rc1, 4.4.0alpha1, 4.4 (plan)

Type:

Problem report

Priority:

Minor

Reporter:

Vjaceslavs Bogdanovs

Assignee:

Andris Mednis

Resolution:

Fixed

Votes:

Labels:

None

Remaining Estimate:

Not Specified

Time Spent:

Not Specified

Original Estimate:

Not Specified

Attachments:

test0.png

test1.png

test2.png

test3.png

test5.png

Team:

Team A

Sprint:

Sprint 50 (Mar 2019), Sprint 51 (Apr 2019), Sprint 52 (May 2019), Sprint 53 (Jun 2019)

Story Points:

Description

Currently Regex matching is not consistent between input strings and Zabbix components.
For the test we can use the same pattern "(.)(.)(.)(.)" (without quotes).

This pattern works with "abcd" on both Server and Frontend:
This pattern works with "абвг" on Frontend (well, it still splits UTF sequences into separate bytes, but it doesn't produce the error), but fails on Server:

Currently, in order for regex to work correctly on Server (as well as on Agent and Proxy), utf flag should be set like so "(*UTF)(.)(.)(.)(.)" (without quotes):

It is expected for Server to work with UTF-8 by default.

Comments

Comment by Andris Mednis [ 2019 Apr 30 ]

Another observation - If preprocessing described above is used for log item with "абвг" then it results in error when inserting into DB:

 21479:20190430:161232.568 query [txnlev:1] [insert into history_log (itemid,clock,ns,timestamp,source,severity,value,logeventid) values (28566,1556629952,74650973,0,'',0,'<D0>, ',0);
]
 21479:20190430:161232.569 [Z3005] query failed: [0] PGRES_FATAL_ERROR:ERROR:  invalid byte sequence for encoding "UTF8": 0xd0 0x2c
 [insert into history_log (itemid,clock,ns,timestamp,source,severity,value,logeventid) values (28566,1556629952,74650973,0,'',0,'<D0>, ',0);
]
 21479:20190430:161232.569 query [insert into history_log (itemid,clock,ns,timestamp,source,severity,value,logeventid) values (28566,1556629952,74650973,0,'',0,'<D0>, ',0);
] failed, setting transaction as failed
 21479:20190430:161232.569 commit called on failed transaction, doing a rollback instead

Comment by Vladislavs Sokurenko [ 2019 May 14 ]

Description of the task seem a bit misleading as it mentions that Zabbix server work differently from frontend, while it actually don't.
Expression:

(.)(.)(.)(.)

Test string:

абвг

It says result is TRUE

Yes that is correct but so is such expression:
Expression:

(.)(.)(.)(.)(.)(.)(.)(.)

Test string:

абвг

So basically it is matching 4 bytes but if you supply 8 it will match them too.

Please confirm ?

vjaceslavs Double checked everything. Looks like PHP treats regex the same way as server (breaks string in the midle of UTF sequences).

vso this solve the issue for the frontend part:

diff --git a/frontends/php/include/classes/regexp/CGlobalRegexp.php b/frontends/php/include/classes/regexp/CGlobalRegexp.php
index 1cb115b2d4..417605e1d4 100644
--- a/frontends/php/include/classes/regexp/CGlobalRegexp.php
+++ b/frontends/php/include/classes/regexp/CGlobalRegexp.php
@@ -165,7 +165,7 @@ class CGlobalRegexp {
        private static function buildRegularExpression(array $expression) {
                $expression['expression'] = str_replace('/', '\/', $expression['expression']);
 
-               $pattern = '/'.$expression['expression'].'/m';
+               $pattern = '/'.$expression['expression'].'/mu';
                if (!$expression['case_sensitive']) {
                        $pattern .= 'i';
                }

As andris mentioned this:

 u (PCRE_UTF8)
    This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.

Comment by Vladislavs Sokurenko [ 2019 May 14 ]

(3) [D] There are performance considerations if we are forcing UTF-8 see

SECURITY CONSIDERATIONS

If you are using PCRE in a non-UTF application that permits users to
supply arbitrary patterns for compilation, you should be aware of a
feature that allows users to turn on UTF support from within a pattern,
provided that PCRE was built with UTF support. For example, an 8-bit
pattern that begins with "(*UTF8)" or "(*UTF)" turns on UTF-8 mode,
which interprets patterns and subjects as strings of UTF-8 characters
instead of individual 8-bit characters. This causes both the pattern
and any data against which it is matched to be checked for UTF-8 valid-
ity. If the data string is very long, such a check might use suffi-
ciently many resources as to cause your application to lose perfor-
mance.

One way of guarding against this possibility is to use the
pcre_fullinfo() function to check the compiled pattern's options for
UTF. Alternatively, from release 8.33, you can set the PCRE_NEVER_UTF
option at compile time. This causes an compile time error if a pattern
contains a UTF-setting sequence.

If your application is one that supports UTF, be aware that validity
checking can take time. If the same data string is to be matched many
times, you can use the PCRE_NO_UTF[8|16|32]_CHECK option for the second
and subsequent matches to save redundant checks.

Another way that performance can be hit is by running a pattern that
has a very large search tree against a string that will never match.
Nested unlimited repeats in a pattern are a common example. PCRE pro-
vides some protection against this: see the PCRE_EXTRA_MATCH_LIMIT fea-
ture in the pcreapi page.

vjaceslavs We should force UTF-8 because Zabbix is an UTF application. It is like taking into consideration that supporting both upper case and lower case letters is slower that supporting only one set. This could be true, but our solution should work for everyone.

vso updated to documentation sub issue, this should be mentioned in upgrade notes in case anyone runs into issues.

andris It was decided to not enforce UTF-8. CLOSED

Comment by Andris Mednis [ 2019 Jun 05 ]

Available in:

4.0.9rc1 cdd1f4b98ff
4.2.3rc1 95d2a2b9b72
4.4.0alpha1(master) 3149a40f810

Generated at Fri Apr 26 04:30:50 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.

[ZBX-15899] Inconsistent behavior of regex for UTF-8 input Created: 2019 Mar 29 Updated: 2024 Apr 10 Resolved: 2019 Jun 08