Uploaded image for project: 'ZABBIX BUGS AND ISSUES'
  1. ZABBIX BUGS AND ISSUES
  2. ZBX-346

Problem during history update causes child to crash, master to hang

XMLWordPrintable

    • Icon: Incident report Incident report
    • Resolution: Cannot Reproduce
    • Icon: Major Major
    • 1.8
    • None
    • None
    • None
    • Master node: zabbix 1.4.5 running on ubuntu7. Seperare postgresql8 database server
      Child node: zabbix 1.4.5 running on RHEL4. Mysql4 running on child node.

      http://www.zabbix.com/forum/showthread.php?t=9277

      Below is an excerpt from the linked forum posts, reflecting my current situation

      Slave node which stops is running RHEL 4ES with mysql 4.1.20 (std mysql version for rhel4).
      Apperantly sometimes it thinks 'mysql has gone away' during a history update from slave (node 3) to master (node 1).
      This on itself is bad, i think, BUT the master also goes into a problematic state, it:

      • Doesnt receive data from active agents
      • Doesnt do agent checks
      • Doesnt do snmp traps
      • Does trigger evaluation
      • Does trigger action processing.

      ------------8<-------------- Slave node log directly after crash
      7848:20080328:105617 NODE 3: Sending new history_uint of node 3 to node 1 datalen 817
      7848:20080328:105626 NODE 3: Sending new history of node 3 to node 1 datalen 191
      7848:20080328:105627 NODE 3: Sending new history_uint of node 3 to node 1 datalen 197
      7848:20080328:105636 NODE 3: Sending new history of node 3 to node 1 datalen 47
      7848:20080328:105815 Error while receiving answer from Node [1]
      7848:20080328:105815 Query::select id,itemid,clock,value from history_uint_sync where nodeid=3 order by id limit 10000
      7848:20080328:105815 Query failed:MySQL server has gone away [2006]
      7826:20080328:105815 One child process died. Exiting ...
      7826:20080328:105817 ZABBIX Server stopped
      ------------8<-------------- Slave node log directly after crash

      -----------8<-------------- Master node log directly after crash
      105478 9450:20080328:155631 Active parameter [windows.ssm[Disk.SmartUsed,G,Percent]] is not supported by agent on host [nlyehvgdc1ms171]
      105479 9451:20080328:155631 NODE 1: Received history from node 2 for node 2 datalen 414
      105480 9452:20080328:155631 Active parameter [system.cpu.util[,system,avg1]] is not supported by agent on host [nlvocl6]
      105481 9452:20080328:155636 Active parameter [system.cpu.util[,idle,avg1]] is not supported by agent on host [nlvu009]
      105482 9451:20080328:155641 Active parameter [net.if.in[eth3,bytes]] is not supported by agent on host [nlvud138]
      105483 9456:20080328:155654 Expression [(

      {100100000030421}#1)|({100100000030420}=1)] cannot be evaluated [Unable to get value for functionid [1001 00000030421]]
      105484 9456:20080328:155654 Expression [({100100000030459}#1)|({100100000030458}=1)] cannot be evaluated [Unable to get value for functionid [1001 00000030459]]
      105485 9456:20080328:155654 Expression [({100100000030425}#1)|({100100000030424}=1)] cannot be evaluated [Unable to get value for functionid [1001 00000030425]]
      105486 9456:20080328:155654 Expression [({100100000030309}#1)|({100100000030308}=1)] cannot be evaluated [Unable to get value for functionid [1001 00000030309]]
      105487 9456:20080328:155654 Expression [({100100000030313}#1)|({100100000030312}=1)] cannot be evaluated [Unable to get value for functionid [1001 00000030313]]
      105488 9456:20080328:155654 Expression [({100100000030315}#1)|({100100000030314}=1)] cannot be evaluated [Unable to get value for functionid [1001 00000030315]]
      105489 9456:20080328:155655 Expression [({100100000030641}#1)|({100100000030640}=1)] cannot be evaluated [Unable to get value for functionid [1001 00000030641]]
      105490 9456:20080328:155655 Expression [({100100000030415}#1)|({100100000030414}=1)] cannot be evaluated [Unable to get value for functionid [1001 00000030415]]
      105491 9456:20080328:155655 Expression [({100100000030463}#1)|({100100000030462}=1)] cannot be evaluated [Unable to get value for functionid [1001 00000030463]]
      105492 9457:20080328:155655 Timeout while answering request
      105493 9457:20080328:155656 Timeout while connecting to [nlvg153:161]
      105494 9457:20080328:155656 Host [nlvg153] will be checked after 60 seconds
      105495 9452:20080328:155656 Active parameter [system.cpu.util[,nice,avg1]] is not supported by agent on host [nlxcips36]
      105496 9456:20080328:155657 Expression [({100100000035985}#1)|({100100000035984}=1)] cannot be evaluated [Unable to get value for functionid [1001 00000035985]]
      105497 9456:20080328:155728 Expression [({100100000030421}

      #1)|(

      {100100000030420}

      =1)] cannot be evaluated [Unable to get value for functionid [1001 00000030421]]
      105498 9456:20080328:155728 Expression [(

      {100100000030459}

      #1)|(

      {100100000030458}

      =1)] cannot be evaluated [Unable to get value for functionid [1001 00000030459]]
      -----------8<-------------- Master node log directly after crash

      Appart from the bad function id relation messages, it does nothing.

      I think the following log line is the one where the connection is lost
      105492 9457:20080328:155655 Timeout while answering request
      The date/time lines differ because the node is located in a different timezone

      @developers
      The fact the slave node stops, i can accept (although i am 90% certain the mysql server does not 'go away'). But the fact that the master node also goes blank is a big problem.
      Any ideas?

      — Update
      I suspect the connection between the master and slave node to have dropped / be unstable. I can imagine that if the connection is dropped during history update, mysql gets the blame (since its read from mysql, sent through a parser and off the master node, but i am not sure of this.

      For further information, feel free to email me.

            Unassigned Unassigned
            xs Tom Duijf (xs-)
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: