Loading...

Type: Problem report
Resolution: Cannot Reproduce
Priority: Trivial
Fix Version/s: None
Affects Version/s: 4.2.8
Component/s: Proxy (P), Server (S)
Labels:
- PROXY
- Poller,
- SNMPv3
- Server
Environment:
Ubuntu 18.04

Sprint:
S2401-2

It acted the same in version 3.4 too. Tried to upgrade but it did not help.

Steps to reproduce:

Setup few SNMPv3 hosts
After some time (few to several hours) notice all pollers (both unreachable and regular ones) are busy. Hosts are supposedly down.

Result:

see "Annotation 2019-11-15 133726.jpg"

see "graph_pollers_busy.jpg"

Please bear with me...

I checked few things and this is what I found out.

First I went to see PS output and noticed that poller and unreachable pollers descriptions do not update at all. See 'ps-pollers-getting-values.jpg'.

Tried strace main zabbix process (with child processes), but there was no action there too. See 'strace-main-zabbix-server-with-child-processes-stuck.jpg'

Then went to strace the pollers. All of them were stuck on select call (tried waiting for a bit) without timeout reading from descriptor 10 - a UDP socket. See 'strace-poller-process-stuck-on-select.jpg' and 'lsof-udp-fd-10.jpg'

What's it doing? Here's a backtrace from gdb - see "gdb-poller-process-bt.jpg"

In zabbix sources it said NETSNMP has its own timeout values, I checked there and saw this piece of code (version 5.4.4) - notice / block without timeout / comment:

int
snmp_synch_response_cb(netsnmp_session * ss,
                       netsnmp_pdu *pdu,
                       netsnmp_pdu **response, snmp_callback pcb)
{
    struct synch_state lstate, *state;
    snmp_callback   cbsav;
    void           *cbmagsav;
    int             numfds, count;
    fd_set          fdset;
    struct timeval  timeout, *tvp;
    int             block;

    memset((void *) &lstate, 0, sizeof(lstate));
    state = &lstate;
    cbsav = ss->callback;
    cbmagsav = ss->callback_magic;
    ss->callback = pcb;
    ss->callback_magic = (void *) state;

    if ((state->reqid = snmp_send(ss, pdu)) == 0) {
        snmp_free_pdu(pdu);
        state->status = STAT_ERROR;
    } else
        state->waiting = 1;

    while (state->waiting) {
        numfds = 0;
        FD_ZERO(&fdset);
        block = NETSNMP_SNMPBLOCK;
        tvp = &timeout;
        timerclear(tvp);
        snmp_select_info(&numfds, &fdset, tvp, &block);
        if (block == 1)
            tvp = NULL;         /* block without timeout */
        count = select(numfds, &fdset, 0, 0, tvp);
        if (count > 0) {
            snmp_read(&fdset);
        } else {
            switch (count) {
            case 0:
                snmp_timeout();
                break;
            case -1:
                if (errno == EINTR) {
                    continue;
                } else {
                    snmp_errno = SNMPERR_GENERR;    /*MTCRITICAL_RESOURCE */
                    /*
                     * CAUTION! if another thread closed the socket(s)
                     * waited on here, the session structure was freed.
                     * It would be nice, but we can't rely on the pointer.
                     * ss->s_snmp_errno = SNMPERR_GENERR;
                     * ss->s_errno = errno;
                     */
                    snmp_set_detail(strerror(errno));
                }
                /*
                 * FALLTHRU 
                 */
            default:
                state->status = STAT_ERROR;
                state->waiting = 0;
            }
        }

        if ( ss->flags & SNMP_FLAGS_RESP_CALLBACK ) {
            void (*cb)(void);
            cb = ss->myvoid;
            cb();        /* Used to invoke 'netsnmp_check_outstanding_agent_requests();'
                            on internal AgentX queries.  */
        }
    }
    *response = state->pdu;
    ss->callback = cbsav;
    ss->callback_magic = cbmagsav;
    return state->status;
}

So all my pollers seem to be stuck waiting forever for a response from UDP socket.

After server restart it goes back to normal.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

Annotation 2019-11-15 133726.jpg
2019 Nov 15 16:45
175 kB
Grzegorz Lachowski
ps-pollers-getting-values.jpg
2019 Nov 15 16:48
527 kB
Grzegorz Lachowski
strace-main-zabbix-server-with-child-processes-stuck.jpg
2019 Nov 15 16:50
13 kB
Grzegorz Lachowski
strace-poller-process-stuck-on-select.jpg
2019 Nov 15 16:51
13 kB
Grzegorz Lachowski
lsof-udp-fd-10.jpg
2019 Nov 15 16:52
55 kB
Grzegorz Lachowski
gdb-poller-process-bt.jpg
2019 Nov 15 16:54
93 kB
Grzegorz Lachowski
graph_pollers_busy.jpg
2019 Nov 15 16:59
198 kB
Grzegorz Lachowski
0001-.PS.-ZBX-16927-added-verbose-debug-for-snmp.patch
2024 Jan 18 13:38
8 kB
Dmitrijs Goloscapovs

Details

Description

Attachments

Attachments

Activity

People

Dates