Uploaded image for project: 'ZABBIX BUGS AND ISSUES'
  1. ZABBIX BUGS AND ISSUES
  2. ZBX-4640

another network error retrying to get a value

    • Icon: Incident report Incident report
    • Resolution: Cannot Reproduce
    • Icon: Critical Critical
    • None
    • 1.9.7 (beta), 1.9.8 (beta)
    • Server (S)
    • None
    • Debian x32 & x64

      I have problems with retrying to get a value.
      First found in version 1.9.7 (fresh install), upgrade to 1.9.9 didn't fixed it. Tested on 2 servers with lots of clients.
      I found some fixed issues on similar errors, but it seems they are not completely fixed, upgrade to the 1.9.9 doen't help.

      Logs are populated with the following:
      17808:20120210:161916.259 resuming Zabbix agent checks on host [lari-casino]: connection restored
      17821:20120210:161923.164 resuming Zabbix agent checks on host [gw.viaden.com]: connection restored
      17796:20120210:161925.241 Zabbix agent item [system.swap.size[,pfree]] on host [lari-poker] failed: first network error, wait for 20 seconds
      17751:20120210:161929.219 Zabbix agent item [system.cpu.load[,avg15]] on host [lari-casino] failed: first network error, wait for 20 seconds
      17812:20120210:161949.182 resuming Zabbix agent checks on host [lari-casino]: connection restored
      17782:20120210:161958.749 Zabbix agent item [vm.memory.size[total]] on host [gw.viaden.com] failed: first network error, wait for 20 seconds
      17782:20120210:162005.730 Zabbix agent item [vfs.fs.size[/,pfree]] on host [lari-casino] failed: first network error, wait for 20 seconds
      17819:20120210:162018.302 resuming Zabbix agent checks on host [gw.viaden.com]: connection restored
      17816:20120210:162025.407 resuming Zabbix agent checks on host [lari-casino]: connection restored
      17785:20120210:162102.411 Zabbix agent item [vfs.fs.inode[/home,pfree]] on host [lari-casino] failed: first network error, wait for 20 seconds
      17806:20120210:162122.346 resuming Zabbix agent checks on host [lari-casino]: connection restored
      17714:20120210:162124.508 Zabbix agent item [system.cpu.util[,idle,avg1]] on host [gw.viaden.com] failed: first network error, wait for 20 seconds
      17793:20120210:162126.288 Zabbix agent item [vm.memory.inactive] on host [gw.viaden.com] failed: another network error, wait for 20 seconds
      17726:20120210:162140.805 Zabbix agent item [system.cpu.load[,avg1]] on host [lari-casino] failed: first network error, wait for 20 seconds
      17805:20120210:162146.459 resuming Zabbix agent checks on host [gw.viaden.com]: connection restored
      17805:20120210:162200.672 resuming Zabbix agent checks on host [lari-casino]: connection restored

      Note, keys and servers are different.
      Tested different UnreachableDelay (from 5 to 20).
      This is not connectivity issue, the same time tested with multiple zabbix_get - no errors at all.

      The agent log with debug enabled shows no errors - it always sends data back.
      tcpdump shows a lot of RST flags from server. It doesn't seem to be right tcp session end.

      I tried to disable checks on host, wait until queue is cleared, then start monitoring again. It doesn't help.
      Agent and server restarts sometimes help, sometimes not. The issue occurs randomly and can dissaper after some time (few hours ordinary), or stay for a long time.
      There are no strange spikes on the internal zabbix monitoring graphs (except housekeeping tasks), network activity and pooling are stable.

          [ZBX-4640] another network error retrying to get a value

          Alexei Vladishev added a comment - - edited

          That's interesting. Could it be related to some limits of Linux kernel related to TCP stack? Do you see anything suspicious in syslog or kern.log?

          Alexei Vladishev added a comment - - edited That's interesting. Could it be related to some limits of Linux kernel related to TCP stack? Do you see anything suspicious in syslog or kern.log?

          Oleksii Zagorskyi added a comment - - edited

          Anton, let me know which error do you see for host error when it's in error state in GUI (temporarily disabling - host unavailable)? Maybe the error is "Invalid port number[]" ?
          Are those items inherited from template?
          If yes, try for one host Delete and clear template and then link it again.

          I had very similar hosts behavior when some part of theirs items did not have defined iterfaceid (they had NULL for iterfaceid) <- because we are using trunk

          Here is behavior when one agent item has NULL for iterfaceid:
          44813:20120211:133759.283 Zabbix agent item [vfs.fs.size[D:\,pfree]] on host [it5] failed: first network error, wait for 15 seconds
          44816:20120211:133814.187 Zabbix agent item [vfs.fs.size[D:\,pfree]] on host [it5] failed: another network error, wait for 15 seconds
          44816:20120211:133829.215 Zabbix agent item [vfs.fs.size[D:\,pfree]] on host [it5] failed: another network error, wait for 15 seconds
          44816:20120211:133844.235 Zabbix agent item [vfs.fs.size[D:\,pfree]] on host [it5] failed: another network error, wait for 15 seconds
          44816:20120211:133859.302 temporarily disabling Zabbix agent checks on host [it5]: host unavailable
          44816:20120211:134000.520 enabling Zabbix agent checks on host [it5]: host became available

          Try to execute this SQL statement to find zabbix agent items with NULL interfaceid on real hosts:
          SELECT DISTINCT h.host,i.itemid,i.name,i.key_,i.interfaceid FROM items i, hosts h WHERE i.type=0 AND i.interfaceid IS NULL AND h.status=0 AND i.hostid=h.hostid;

          Oleksii Zagorskyi added a comment - - edited Anton, let me know which error do you see for host error when it's in error state in GUI (temporarily disabling - host unavailable)? Maybe the error is "Invalid port number[]" ? Are those items inherited from template? If yes, try for one host Delete and clear template and then link it again. I had very similar hosts behavior when some part of theirs items did not have defined iterfaceid (they had NULL for iterfaceid) <- because we are using trunk Here is behavior when one agent item has NULL for iterfaceid: 44813:20120211:133759.283 Zabbix agent item [vfs.fs.size [D:\,pfree] ] on host [it5] failed: first network error, wait for 15 seconds 44816:20120211:133814.187 Zabbix agent item [vfs.fs.size [D:\,pfree] ] on host [it5] failed: another network error, wait for 15 seconds 44816:20120211:133829.215 Zabbix agent item [vfs.fs.size [D:\,pfree] ] on host [it5] failed: another network error, wait for 15 seconds 44816:20120211:133844.235 Zabbix agent item [vfs.fs.size [D:\,pfree] ] on host [it5] failed: another network error, wait for 15 seconds 44816:20120211:133859.302 temporarily disabling Zabbix agent checks on host [it5] : host unavailable 44816:20120211:134000.520 enabling Zabbix agent checks on host [it5] : host became available Try to execute this SQL statement to find zabbix agent items with NULL interfaceid on real hosts: SELECT DISTINCT h.host,i.itemid,i.name,i.key_,i.interfaceid FROM items i, hosts h WHERE i.type=0 AND i.interfaceid IS NULL AND h.status=0 AND i.hostid=h.hostid;

          Anton Ryabchenko added a comment - - edited

          I found NULL interfaceid on one of the hosts on one of the servers. There were itemes, that were added directly to host, then I copied them to the template, but interfaceid is null.
          Even after 'Unlink and clear' and link again - interfaceid is null. And errors continue.
          But the second server has no items with null interfaceid.

          I use all items inherited from tamplates.
          I see no errors in general Linux logs and , as I mentioned before, it's definetly not a network/os/limit issue - I checked these things first.
          Sorry, cannot figure out where should I see the state in GUI (Monitoring -> Hosts shows availability icon only)

          Anton Ryabchenko added a comment - - edited I found NULL interfaceid on one of the hosts on one of the servers. There were itemes, that were added directly to host, then I copied them to the template, but interfaceid is null. Even after 'Unlink and clear' and link again - interfaceid is null. And errors continue. But the second server has no items with null interfaceid. I use all items inherited from tamplates. I see no errors in general Linux logs and , as I mentioned before, it's definetly not a network/os/limit issue - I checked these things first. Sorry, cannot figure out where should I see the state in GUI (Monitoring -> Hosts shows availability icon only)

          > Sorry, cannot figure out where should I see the state in GUI (Monitoring -> Hosts shows availability icon only)
          Yes, it is. When a host (agent, snmp, etc) in error state then you can move mouse oved RED icon and you will see tool-tip with error description.
          Which error do you see?

          Note interfaceid=NULL for items in template is ok.

          Try to delete this item (copied-recopied ) and create it manually again.

          Oleksii Zagorskyi added a comment - > Sorry, cannot figure out where should I see the state in GUI (Monitoring -> Hosts shows availability icon only) Yes, it is. When a host (agent, snmp, etc) in error state then you can move mouse oved RED icon and you will see tool-tip with error description. Which error do you see? Note interfaceid=NULL for items in template is ok. Try to delete this item (copied-recopied ) and create it manually again.

          In GUI for one server I see the following 2 errors:
          "...error (111): connection refused" (see attachment for full error)
          "Invalid port number[]"
          Both errors appears on the JMX hosts (we use zapcat agent and port 10052 for monitoring).
          We have used zapcat for weeks without any problem, but sometimes we had white spaces in our graphs, I thought it was caused by performance. But it's not.

          Trying to reproduce on another server. to see an error.

          Anton Ryabchenko added a comment - In GUI for one server I see the following 2 errors: "...error (111): connection refused" (see attachment for full error) "Invalid port number[]" Both errors appears on the JMX hosts (we use zapcat agent and port 10052 for monitoring). We have used zapcat for weeks without any problem, but sometimes we had white spaces in our graphs, I thought it was caused by performance. But it's not. Trying to reproduce on another server. to see an error.

          > We have used zapcat for weeks without any problem, but sometimes we had white spaces in our graphs, I thought it was caused by performance. But it's not.
          No, that happened because the hosts (zabbix-agent type I mean) were periodically disabled because even of single problematic item (without interfaceid in the DB)

          So you have fix all problems which are generating the error "Invalid port number[]"
          Maybe interface is not empty, but the port is empty?

          Yes, I can generate the error "Invalid port number []" when "interface.port" is empty (deleted manually in the DB)

          Try this:
          mysql> SELECT * FROM interface WHERE port="";
          and its result should be empty.

          Oleksii Zagorskyi added a comment - > We have used zapcat for weeks without any problem, but sometimes we had white spaces in our graphs, I thought it was caused by performance. But it's not. No, that happened because the hosts (zabbix-agent type I mean) were periodically disabled because even of single problematic item (without interfaceid in the DB) So you have fix all problems which are generating the error "Invalid port number[]" Maybe interface is not empty, but the port is empty? Yes, I can generate the error "Invalid port number []" when "interface.port" is empty (deleted manually in the DB) Try this: mysql> SELECT * FROM interface WHERE port=""; and its result should be empty.

          I have
          SELECT * FROM interface WHERE port="";
          Empty set (0.00 sec)

          Anton Ryabchenko added a comment - I have SELECT * FROM interface WHERE port=""; Empty set (0.00 sec)

          I have no more ideas, sorry.

          Oleksii Zagorskyi added a comment - I have no more ideas, sorry.

          Wow, it's seems the issue disapeared from one of the servers!
          I have unlinked templates with cleanup and linked them back on the problematic hosts.
          I have no more NULLs in the DB and no errors in logs
          Thanks a lot!

          p.s. Problems on the other server can be network issues indeed, there are some hosts monitored over Internet from US to Asia.

          Anton Ryabchenko added a comment - Wow, it's seems the issue disapeared from one of the servers! I have unlinked templates with cleanup and linked them back on the problematic hosts. I have no more NULLs in the DB and no errors in logs Thanks a lot! p.s. Problems on the other server can be network issues indeed, there are some hosts monitored over Internet from US to Asia.

          Be careful with trunk next time, but thanks for using it in production

          Issue closed.

          Oleksii Zagorskyi added a comment - Be careful with trunk next time, but thanks for using it in production Issue closed.

            Unassigned Unassigned
            sineex Anton Ryabchenko
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: