[#ZBX-15721] Log text garbled problem non UTF-8 "encoding" parameter in log[] key on HP-UX11

[ZBX-15721] Log text garbled problem non UTF-8 "encoding" parameter in log[] key on HP-UX11 Created: 2019 Feb 25 Updated: 2024 Apr 10 Resolved: 2019 Apr 09
Status:	Closed
Project:	ZABBIX BUGS AND ISSUES
Component/s:	Agent (G), Server (S)
Affects Version/s:	4.0.4
Fix Version/s:	3.0.27rc1, 4.0.7rc1, 4.2.1rc1, 4.4.0alpha1, 4.4 (plan)

Type:

Problem report

Priority:

Blocker

Reporter:

Kim Jongkwon

Assignee:

Viktors Tjarve

Resolution:

Fixed

Votes:

Labels:

encoding

Remaining Estimate:

Not Specified

Time Spent:

Not Specified

Original Estimate:

Not Specified

Issue Links:

Duplicate
Sub-task
depends on	~~ZBX-18635~~	Log text garbled problem non UTF-8 "e...	Closed

Team:

Team A

Sprint:

Sprint 49 (Feb 2019), Sprint 50 (Mar 2019), Sprint 51 (Apr 2019)

Story Points:

Description

Zabbix Server 4.0.4 (CentOS7/MariaDB)
Log history text was garbled from HP-UX agents (encoding - eucJP)

1. Agent 2.2.19 (HP-UX11) => Server 2.2.19 (CentOS7) was OK
2. Agent 2.2.19 => Server 4.0.4 (Update) - Japanese characters were garbled.
3. Agent 4.0.3(Update) => Server 4.0.4 - also has same problem. so can't be solved.

We are investigating this problem now.

Comments

Comment by Kodai Terashima [ 2019 Feb 25 ]

Moved from ZBX-15692, I changed reporter, assignee, fix versions(s) same as original ticket, and moved some comments.

Other status on original issue was:

Status: In progress
Team: Team A
Sprint: Sprint 49 (Feb 2019)
Story points: 3

Comment by Kodai Terashima [ 2019 Feb 25 ]

JKKim

Story :
This problem was first discovered on the HP-UX11 agent in Zabbix 2.2.19 -> CentOS7 server in Updated Zabbix 4.0.4. (It does not happen when the server is the same Zabbix 2.2.19)
So we tested again with Agent 4.0.3 (HP-UX 11) => Server 4.0.4 (CentOS7) and we found the same problem.
And also, 4.0.3 -> 2.2.19 was fine. I think this problem is with Zabbix Server 4.0.4 only. that's a strange point.

FYI:
This issue has been identified only with HP-UX11 and eucJP, I think some behavior of the iconv library might be a problem. I don't know why even if the same "eucJP" Logfile is used, didn't occur when the same OS on "CentOS 7". Zabbix Agent 4.0.3 (CentOS7) -> Zabbix Server 4.0.4 (CentOS7) is OK

Comment by Kodai Terashima [ 2019 Feb 25 ]

cyclone

This is not a server issue, I'm 100% sure. Agent is sending identical (identically invalid) data in all cases, both 2.2 and 4.0 agent are affected. The issue came up after server upgrade probably because ~~MariaDB is involved. I've already seen several encoding-related tickets where MariaDB and its lack of UTF-8 validation was the root cause.~~ changes were made to JSON decoding in ~~ZBX-13782~~.

I think that there is something wrong in how agent converts valid UTF-8 data into JSON string. Have a look at 6‍^th Japanese character after "Normal:" → "ピ" (U+30D4). In UTF-8 it is represented as a 3-byte sequence: 0xe3 0x83 0x94. This is a valid Unicode character, this is no control character, therefore it is perfectly valid for JSON string, but instead of copying it as is Zabbix agent decides to encode 0x94 as if it was a separate byte... And inserts "\u0094" in JSON...

Instead of

... e3 83 94 ...

there is a byte sequence

... e3 83 5c 75 30 30 39 34 ...

Problem is in __zbx_json_stringsize() and __zbx_json_insstring() which encode strings byte by byte not paying attention to UTF-8 sequences and using iscntrl() function to detect control characters. Unicode defines two ranges of control characters:

U+0000 - U+001F
U+007F - U++009F

I don't know what iscntrl() means by "control characters", I suspect it may be locale-dependent (seen in ~~ZBX-13186~~) and it may classify some of valid UTF-8 bytes (e.g. 0x80 - 0x9F are valid non-first UTF-8 bytes) as control.

Attached [^test.c] prints ranges of character codes which are considered control:

$ LC_ALL=en_US.utf8 ./a.out 
0 - 31
127 - 127
$ LC_ALL=en_US ./a.out 
0 - 31
127 - 159

It is a mistake to encode such valid UTF-8 bytes as control characters using "\u..." JSON string escapes.

Comment by Kodai Terashima [ 2019 Feb 25 ]

JKKim

Thanks Glebs!
Your advice is very helpful in tracking the issues.
So now simply, Let's check of "Zabbix-server update problem" (case of User's point of view.)

First I checked Zabbix-Server 3.4.9 => 3.4.10

zabbix-server 3.4.9 : OK
zabbix-server 3.4.10 : text cut-off

Yes, so I found one more case now. (text cut off from zabbix-server 3.4.10)
but highly related to ~~ZBX-13782~~ with this problem.

Second, I checked Zabbix-Server 4.0.3 => 4.0.4

zabbix-server 4.0.3 : still text cut-off
zabbix-server 4.0.4 : text garbled

Maybe highly related to ~~ZBX-15224~~

Comment by Kodai Terashima [ 2019 Feb 25 ]

JKKim

/src/libs/zbxjson/json.c
zbx_json_decode_character() was changed from 3.4.10 (~~ZBX-13782~~)

Okay, now It's clear that fix related to this issue.
I have put the source of 3.4.9 into 4.0.4 for a simple check like this.

                case 't':
                        bytes[0] = '\t';
                        break;
                case 'u': // JUST CHECK - return to source code from 3.4.9 - added by kim
                        *p += 3; /* "u00" */
                        bytes[0] = zbx_hex2num(**p) << 4;
                        bytes[0] += zbx_hex2num(*(++*p));
                        break;
                default:
                        break;

And Recompile the server, This text garbled problem are disappears.
Please check about a detail on this issue.

Comment by Kodai Terashima [ 2019 Feb 25 ]

cyclone

Judging by ChangeLog, ~~ZBX-15224~~ is the most likely candidate to explain the difference between 4.0.3 and 4.0.4. By the way, "text cut-off" was what user initially complained about in ~~ZBX-15224~~ and as I understand it was caused by specifics of MariaDB.

As a workaround, try to set locale for agent to <something>.UTF-8, <something>.utf8, POSIX or C, whatever is available on that box.

Comment by Viktors Tjarve [ 2019 Mar 13 ]

I was able to reproduce this issue upgrading 3.0.26rc1 to 4.0.4 without making any changes to agent - I kept agent 3.0.26rc1.

Comment by Vladislavs Sokurenko [ 2019 Mar 20 ]

Looks like issue is due to char being signed

Possible fix:

Index: src/libs/zbxjson/json.c
===================================================================
--- src/libs/zbxjson/json.c	(revision 91289)
+++ src/libs/zbxjson/json.c	(working copy)
@@ -162,9 +162,9 @@
 
 static size_t	__zbx_json_stringsize(const char *string, zbx_json_type_t type)
 {
-	size_t		len = 0;
-	const char	*sptr;
-	char		buffer[] = {"null"};
+	size_t			len = 0;
+	const unsigned char	*sptr;
+	char			buffer[] = {"null"};
 
 	for (sptr = (NULL != string ? string : buffer); '\0' != *sptr; sptr++)
 	{

Comment by Glebs Ivanovskis [ 2019 Mar 20 ]

vso, why? I believe the issue is not platform-dependent. Has viktors.tjarve reproduced the issue on HP-UX?

Comment by Vladislavs Sokurenko [ 2019 Mar 20 ]

I corrected comment cyclone, please see possible fix. Both HP-UX and Linux are signed so code must be changed.

Comment by Glebs Ivanovskis [ 2019 Mar 20 ]

Why do you call it "possible fix"? If it works, it is not just "possible", but "actual". If it doesn't, it's not a "fix" at all.

Could you please elaborate how unsigned solves the issue?

Comment by Vladislavs Sokurenko [ 2019 Mar 20 ]

Issue was reproduced and changing to unsigned seemed to help, viktors.tjarve will retest tomorrow. It's possible because I did not test myself and because this value is later passed to iscntrl() function which works different between platforms but I guess that it does not like negative values and thinks that they are in the control range due to being less than 0x1f, we will get back to you with the results tomorrow, thanks !

Comment by Glebs Ivanovskis [ 2019 Mar 20 ]

I'm not buying this explanation because, as I said earlier, the UTF-8 sequence causing troubles is 0xe3 0x83 0x94, but only 0x94 gets replaced with \u0094 despite 0x83 being "negative" as well.

Comment by Vladislavs Sokurenko [ 2019 Mar 20 ]

I don't think that we should delve into reasons for wrong behavior and simply pass unsigned char as mentioned here:

These functions check whether c, which must have the value of an unsigned char or EOF, falls into a certain character class according to the current locale.

Comment by Glebs Ivanovskis [ 2019 Mar 20 ]

My opinion is that passing any char to iscntrl() is incorrect, because

this function is locale-dependent and
wasn't meant to be used to check for Unicode control characters.

Some of Unicode control characters are not even single-byte, take U+008f for example - it is encoded as 0xc2 0x8f in UTF-8 and iscntrl() returns 0 for both of them (at least on my machine).

Comment by Andris Mednis [ 2019 Mar 26 ]

I think we have to find all places where simple 'char c' is passed to <ctype.h>-family functions (like iscntrl(c), isalnum(c) etc.) and fix them with type cast iscntrl((unsigned char)c), isalnum((unsigned char)c) etc.

On Linux it happens to work correctly even without casting to 'unsigned char', but on HP-UX it does not.

This does not relate to Gleb's opinion which points to another problem - <ctype.h>-functions cannot work with multibyte UTF-8 characters

Comment by Andris Mednis [ 2019 Mar 30 ]

cyclone wrote:

My opinion is that passing anychar to iscntrl() is incorrect, because

this function is locale-dependent and

wasn't meant to be used to check for Unicode control characters.

Some of Unicode control characters are not even single-byte, take U+008f for example - it is encoded as 0xc2 0x8f in UTF-8 and iscntrl() returns 0 for both of them (at least on my machine).

I agree. Colleagues have already pointed out that The JavaScript Object Notation (JSON) Data Interchange Format when talks about control characters to be escaped, mentions only U+0000 through U+001F. So, even 7F is not required to be escaped, not to mention "Unicode control characters". Seems like iscntrl() should be replaced with direct comparison against 00...1F interval both in our encoder and decoder.

Comment by Vladislavs Sokurenko [ 2019 Mar 30 ]

I agree, please comment

But we would need unsigned char anyway.

Comment by Glebs Ivanovskis [ 2019 Mar 30 ]

Definition from http://json.org seems to be in agreement that only U+0000 — U+001F need special treatment:

character
    '0020' . '10ffff' - '"' - '\'
    '\' escape

Comment by Andris Mednis [ 2019 Apr 01 ]

Created new development branch svn://svn.zabbix.com/branches/dev/ZBX-15721-30 ( to fix starting from 3.0).

Comment by Andris Mednis [ 2019 Apr 02 ]

cyclone wrote:

My opinion is that passing anychar to iscntrl() is incorrect, because

this function is locale-dependent and

As far as I understand - yes, iscntrl() is locale dependent, but .... Zabbix does not use setlocale(), so iscntrl() operates in the standard `C' locale and classifies only 0-1F and 7F as control characters, regardless of environment settings LANG, LC_*. So, the damage from iscntrl() seems not that large as initially estimated.

Comment by Glebs Ivanovskis [ 2019 Apr 02 ]

Zabbix does not use setlocale()

Zabbix doesn't, but there are libraries, ODBC drivers, loadable modules and whatnot... operating in the same context. See ~~ZBX-11512~~.

Comment by Andris Mednis [ 2019 Apr 02 ]

Thanks, Gleb!

I forgot about libraries...

Comment by Andris Mednis [ 2019 Apr 03 ]

Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-15721-30

Comment by Viktors Tjarve [ 2019 Apr 04 ]

Successfully tested branch svn://svn.zabbix.com/branches/dev/ZBX-15721-30

Comment by Viktors Tjarve [ 2019 Apr 08 ]

Released in:

3.0.27rc1 r92205
4.0.7rc1 r92206
4.2.1rc1 r92207
4.4.0alpha1 r92209

Generated at Thu Apr 25 06:58:48 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.

[ZBX-15721] Log text garbled problem non UTF-8 "encoding" parameter in log[] key on HP-UX11 Created: 2019 Feb 25 Updated: 2024 Apr 10 Resolved: 2019 Apr 09