[ZBX-7936] wonky connection to agent due to "Interrupted system call" (cloned - reopened) Created: 2014 Mar 14  Updated: 2019 Dec 10

Status: Confirmed
Project: ZABBIX BUGS AND ISSUES
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Incident report Priority: Trivial
Reporter: Lorenzo Bagni Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: timeout
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Fedora 17 x86_64, selinux in permissive mode, kernel 3.5.2-1.fc17.x86_64


Attachments: File ZBX-7936.tar.gz     PNG File mysqlperf.png     PNG File sysperf.png    

 Description   

Server log file is full of following:

 30907:20120821:225740.503 Zabbix agent item [net.tcp.service[smtp]] on host [localhost] failed: another network error, wait for 15 seconds
 30907:20120821:225755.505 resuming Zabbix agent checks on host [localhost]: connection restored
 30904:20120821:225822.907 Zabbix agent item [net.tcp.service[smtp]] on host [localhost] failed: first network error, wait for 15 seconds
 30907:20120821:225840.602 Zabbix agent item [net.tcp.service[smtp]] on host [localhost] failed: another network error, wait for 15 seconds
 30907:20120821:225855.605 resuming Zabbix agent checks on host [localhost]: connection restored

There are problems collecting items (there are random gaps in graphs).

There's nothing relevant in agent log.
smtp server is running:

$ sudo -u zabbix telnet localhost smtp
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.

Selinux is in permissive mode.

Running in debug more adds more information:

Server log:

 30679:20120821:225421.743 sleeping for 3 seconds
 30702:20120821:225421.947 In collect_selfmon_stats()
 30702:20120821:225421.948 End of collect_selfmon_stats()
 30702:20120821:225421.948 sleeping for 1 seconds
 30682:20120821:225422.239 Item [localhost:net.tcp.service[smtp]] error: Get value from agent failed: ZBX_TCP_READ() failed: [4] Interrupted system call
 30682:20120821:225422.239 In zabbix_log()
 30682:20120821:225422.239 In DCconfig_get_items() hostid:0 key:'zabbix[log]'
 30682:20120821:225422.239 End of DCconfig_get_items():0
 30682:20120821:225422.239 End of zabbix_log()
 30682:20120821:225422.239 End of get_value():NETWORK_ERROR
 30682:20120821:225422.239 In deactivate_host() hostid:10017 itemid:79 type:0
 30682:20120821:225422.239 deactivate_host() errors_from:0 available:1
 30682:20120821:225422.239 query [txnlev:1] [begin;]
 30682:20120821:225422.239 query [txnlev:1] [update hosts set errors_from=1345586062,disable_until=1345586077 where hostid=10017]
 30682:20120821:225422.240 query [txnlev:1] [commit;]
 30682:20120821:225422.306 Zabbix agent item [net.tcp.service[smtp]] on host [localhost] failed: first network error, wait for 15 seconds
 30682:20120821:225422.306 In zabbix_log()
 30682:20120821:225422.306 In DCconfig_get_items() hostid:0 key:'zabbix[log]'
 30682:20120821:225422.306 End of DCconfig_get_items():0
 30682:20120821:225422.306 End of zabbix_log()
 30682:20120821:225422.306 End of deactivate_host()
 30682:20120821:225422.306 End of get_values()
 30682:20120821:225422.307 poller #4 spent 3.068150 seconds while updating 1 values
 30682:20120821:225422.307 In DCconfig_get_poller_nextcheck() poller_type:0
 30682:20120821:225422.307 End of DCconfig_get_poller_nextcheck():1345586064
 30682:20120821:225422.307 sleeping for 2 seconds

agent log:

29584:20120821:225421.275 Processing request.
 29584:20120821:225421.275 Requested [proc.num[,,run]]
 29584:20120821:225421.295 Sending back [1]
 29585:20120821:225421.615 In send_buffer() host:'127.0.0.1' port:10051 values:0/100
 29585:20120821:225421.615 End of send_buffer():SUCCEED
 29585:20120821:225421.615 Sleeping for 1 second(s)
 29581:20120821:225421.713 In update_cpustats()
 29581:20120821:225421.713 End of update_cpustats()
 29583:20120821:225422.239 TCP expect network error: ZBX_TCP_READ() failed: [4] Interrupted system call
 29583:20120821:225422.239 Sending back [0]
 29583:20120821:225422.239 Got signal [signal:13(SIGPIPE),sender_pid:29583]. Ignoring ...
 29583:20120821:225422.239 Process listener error: ZBX_TCP_WRITE() failed: [32] Broken pipe
 29585:20120821:225422.615 In send_buffer() host:'127.0.0.1' port:10051 values:0/100
 29585:20120821:225422.615 End of send_buffer():SUCCEED
 29585:20120821:225422.615 Sleeping for 1 second(s)
 29581:20120821:225422.713 In update_cpustats()
 29581:20120821:225422.713 End of update_cpustats()
 29585:20120821:225423.615 In send_buffer() host:'127.0.0.1' port:10051 values:0/100
 29585:20120821:225423.615 End of send_buffer():SUCCEED
 29585:20120821:225423.615 Sleeping for 1 second(s)
 29581:20120821:225423.714 In update_cpustats()
 29581:20120821:225423.714 End of update_cpustats()
 29585:20120821:225424.615 In send_buffer() host:'127.0.0.1' port:10051 values:0/100
 29585:20120821:225424.615 End of send_buffer():SUCCEED


 Comments   
Comment by Lorenzo Bagni [ 2014 Mar 14 ]

Hi,
I've faced the same issue on zabbix 2.2.2 compiled and running on Centos 6.5.
The agentd logfile often report a broken pipe

645:20140311:142753.583 collector [processing data]
645:20140311:142753.583 In update_cpustats()
645:20140311:142753.583 End of update_cpustats()
645:20140311:142753.583 collector [idle 1 sec]
647:20140311:142753.918 Requested [system.cpu.intr]
647:20140311:142753.918 Sending back [0]
647:20140311:142753.918 Got signal [signal:13(SIGPIPE),sender_pid:647]. Ignoring ...
647:20140311:142753.918 Process listener error: ZBX_TCP_WRITE() failed: [32] Broken pipe
647:20140311:142753.918 listener #2 [waiting for connection]
647:20140311:142753.918 listener #2 [processing request]
649:20140311:142754.102 In send_buffer() host:'127.0.0.1' port:10051 values:0/100
649:20140311:142754.102 End of send_buffer():SUCCEED
649:20140311:142754.102 active checks #1 [idle 1 sec]

This issue looks like a problem on zabbix_agentd binded on 127.0.0.1 with this server directive:

Server=127.0.0.1,zbx01.mydomain.local,zbx02.mydomain.local,192.168.8.9,192.168.8.11

the system resolve the domain correctly, but if I switch the FQDN to ip all come to work correctly (see attach).

Comment by Aleksandrs Saveljevs [ 2014 Mar 14 ]

Is the timeout on the agent lower than the timeout on the server? Obviously, if the two are equal or server's timeout is less, then this behavior is expected.

Also, the reason that the connection from the agent to SMTP server might timeout is because SMTP server might be doing reverse DNS resolution on the connection. If it takes too long, the agent will timeout, even though SMTP server is running. Similar problem with DNS might be with the "Server=" directive you mentioned in the end.

Comment by Lorenzo Bagni [ 2014 Mar 14 ]

Hi,
the server timeout is 30, the agent timeout 15, so I'm not in expected behaviour.

I've cloned this issue because I've faced a similar problem of ZBX-5476, but in my case the issue is targered on agent, it seems to be in difficult when a direct DNS resolution is required.
The check (passive) requested from server to agent are inherited by 'Template App MySQL' and 'Template OS Linux'.

The DNS servers on my environment are internal on private network, so there aren't possible problems like network connectivity or availability.

Comment by Aleksandrs Saveljevs [ 2014 Mar 14 ]

Oh, so the issue description comes from the original issue and does not describe your particular installation and your problem has nothing to do with SMTP servers. Graphs are yours, though.

To clarify, do I understand correctly that the problem is with the agent receiving SIGPIPE when it tries to send data back? Do I also understand correctly that it only occurs if you have domain names in "Server" directive and if there are no domain names there, then there is no such problem? Is your agent compiled with IPv6 enabled?

Comment by Lorenzo Bagni [ 2014 Mar 14 ]

Yes, sorry for misunderstanding, but I've found the ZBX-5476 and I thought that was similar problem.

My environment is a zabbix server ( 2.2.2 on Centos 6.5) with mysql, apache, snmp, snmptrapd, snmptt, postfix and I've compiled on the machine the source with config:

./configure --enable-server --with-mysql --with-net-snmp --with-jabber --with-libcurl --enable-agent --enable-ipv6 --with-ldap --with-unixodbc --with-openipmi --with-ssh2

I've used the templates 'Template App MySQL' and 'Template OS Linux' to monitor itself, and when I specify on client:

Server=127.0.0.1,zbx01.mydomain.local,zbx02.mydomain.local,192.168.8.9,192.168.8.11

I've got the issue

but if I use the ip:

Server=127.0.0.1,192.168.8.8,192.168.8.10,192.168.8.9,192.168.8.11

I haven't got the issue.

zabbix_server and zabbix_agentd runs on: zbx02.mydomain.local 192.168.8.10
so logically 127.0.0.1 zbx02.mydomain.local 192.168.8.10 are the same thing.

Comment by Lorenzo Bagni [ 2014 Mar 14 ]

Info sent

Comment by Aleksandrs Saveljevs [ 2014 Mar 17 ]

The issue does not reproduce itself locally. Could you please run agent listener under strace?

In order to find the PID of the listener process, you can either take a look at the top of agent log file and find a line like " 21449:20140317:125635.703 agent #1 started [listener #1]" (here, PID is 21449) or run "ps aux | grep zabbix_agentd.*listener". After that run "strace -p <PID> -s 128 -ttt -v" and attach both agent log (with DebugLevel=4) and strace output here.

Comment by Lorenzo Bagni [ 2014 Mar 17 ]

Hi,
my issue was on production environment, but I have a testing env, let me try to reproduce the issue on test
and I'll send all information.

If I fail in testing environment I'll switch the configuration in production.

Comment by Lorenzo Bagni [ 2014 Mar 26 ]

Agent and strace logs

Comment by Lorenzo Bagni [ 2014 Mar 26 ]

Hi, in attachment there are the zabbix_agentd.log (DebugLevel=4) and the starce log as requested.

[root@zbx ~]# grep 'Broken pipe' /tmp/zabbix_agentd.log
 17839:20140326:110234.102 Process listener error: ZBX_TCP_WRITE() failed: [32] Broken pipe
 17838:20140326:110237.208 Process listener error: ZBX_TCP_WRITE() failed: [32] Broken pipe
 17840:20140326:110240.159 Process listener error: ZBX_TCP_WRITE() failed: [32] Broken pipe
 17839:20140326:110244.112 Process listener error: ZBX_TCP_WRITE() failed: [32] Broken pipe
 17838:20140326:110247.226 Process listener error: ZBX_TCP_WRITE() failed: [32] Broken pipe

these are the latest 5 Broken pipe, also grabbed from strace.

Generated at Tue Apr 16 13:24:28 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.