Uploaded image for project: 'ZABBIX BUGS AND ISSUES'
  1. ZABBIX BUGS AND ISSUES
  2. ZBX-16927

Pollers stuck waiting for response without timeout

    XMLWordPrintable

    Details

    • Type: Problem report
    • Status: Need info
    • Priority: Trivial
    • Resolution: Unresolved
    • Affects Version/s: 4.2.8
    • Fix Version/s: None
    • Component/s: Proxy (P), Server (S)
    • Environment:
      Ubuntu 18.04
    • Team:
      Team A
    • Sprint:
      Sprint 58 (Nov 2019), Sprint 59 (Dec 2019), Sprint 60 (Jan 2020), Sprint 61 (Feb 2020), Sprint 62 (Mar 2020), Sprint 63 (Apr 2020), Sprint 64 (May 2020), Sprint 65 (Jun 2020), Sprint 66 (Jul 2020), Sprint 67 (Aug 2020)

      Description

      It acted the same in version 3.4 too. Tried to upgrade but it did not help.

      Steps to reproduce:

      1. Setup few SNMPv3 hosts
      2. After some time (few to several hours) notice all pollers (both unreachable and regular ones) are busy. Hosts are supposedly down.

      Result:

      see "Annotation 2019-11-15 133726.jpg"

      see "graph_pollers_busy.jpg"

      Please bear with me...

      I checked few things and this is what I found out.

      First I went to see PS output and noticed that poller and unreachable pollers descriptions do not update at all. See 'ps-pollers-getting-values.jpg'.

      Tried strace main zabbix process (with child processes), but there was no action there too. See 'strace-main-zabbix-server-with-child-processes-stuck.jpg'

      Then went to strace the pollers. All of them were stuck on select call (tried waiting for a bit) without timeout reading from descriptor 10 - a UDP socket. See 'strace-poller-process-stuck-on-select.jpg' and 'lsof-udp-fd-10.jpg'

      What's it doing? Here's a backtrace from gdb - see "gdb-poller-process-bt.jpg"

      In zabbix sources it said NETSNMP has its own timeout values, I checked there and saw this piece of code (version 5.4.4) - notice / block without timeout / comment:

      int
      snmp_synch_response_cb(netsnmp_session * ss,
                             netsnmp_pdu *pdu,
                             netsnmp_pdu **response, snmp_callback pcb)
      {
          struct synch_state lstate, *state;
          snmp_callback   cbsav;
          void           *cbmagsav;
          int             numfds, count;
          fd_set          fdset;
          struct timeval  timeout, *tvp;
          int             block;
      
          memset((void *) &lstate, 0, sizeof(lstate));
          state = &lstate;
          cbsav = ss->callback;
          cbmagsav = ss->callback_magic;
          ss->callback = pcb;
          ss->callback_magic = (void *) state;
      
          if ((state->reqid = snmp_send(ss, pdu)) == 0) {
              snmp_free_pdu(pdu);
              state->status = STAT_ERROR;
          } else
              state->waiting = 1;
      
          while (state->waiting) {
              numfds = 0;
              FD_ZERO(&fdset);
              block = NETSNMP_SNMPBLOCK;
              tvp = &timeout;
              timerclear(tvp);
              snmp_select_info(&numfds, &fdset, tvp, &block);
              if (block == 1)
                  tvp = NULL;         /* block without timeout */
              count = select(numfds, &fdset, 0, 0, tvp);
              if (count > 0) {
                  snmp_read(&fdset);
              } else {
                  switch (count) {
                  case 0:
                      snmp_timeout();
                      break;
                  case -1:
                      if (errno == EINTR) {
                          continue;
                      } else {
                          snmp_errno = SNMPERR_GENERR;    /*MTCRITICAL_RESOURCE */
                          /*
                           * CAUTION! if another thread closed the socket(s)
                           * waited on here, the session structure was freed.
                           * It would be nice, but we can't rely on the pointer.
                           * ss->s_snmp_errno = SNMPERR_GENERR;
                           * ss->s_errno = errno;
                           */
                          snmp_set_detail(strerror(errno));
                      }
                      /*
                       * FALLTHRU 
                       */
                  default:
                      state->status = STAT_ERROR;
                      state->waiting = 0;
                  }
              }
      
              if ( ss->flags & SNMP_FLAGS_RESP_CALLBACK ) {
                  void (*cb)(void);
                  cb = ss->myvoid;
                  cb();        /* Used to invoke 'netsnmp_check_outstanding_agent_requests();'
                                  on internal AgentX queries.  */
              }
          }
          *response = state->pdu;
          ss->callback = cbsav;
          ss->callback_magic = cbmagsav;
          return state->status;
      }
      

      So all my pollers seem to be stuck waiting forever for a response from UDP socket.

      After server restart it goes back to normal.

       

        Attachments

          Activity

            People

            Assignee:
            zabbix.dev Zabbix Development Team
            Reporter:
            gregolsky Grzegorz Lachowski
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Dates

              Created:
              Updated: