[ZBX-10033] On Windows agent fails to acquire lock in case lock was not released properly (possibly by other agent) Created: 2015 Oct 30  Updated: 2017 May 30  Resolved: 2015 Dec 13

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Agent (G)
Affects Version/s: 2.2.10
Fix Version/s: 3.0.0alpha5

Type: Incident report Priority: Major
Reporter: Sandis Neilands (Inactive) Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: semaphores
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Windows Server 2008.


Attachments: Text File abandoned.patch    

 Description   

See sub-issue (133) of ZBXNEXT-1263, ZBX-10034.

Description

Agents will not start In case of ...

  • ... multiple agents running on the same Windows host...
  • ... in the same session (e.g. the same user or as a service) ...
  • ... if one of them fails or is closed while it has acquired some lock (for example logging lock).

Workarounds

  • full system reboot;
  • closing all Zabbix agents running in the affected session, waiting for Windows to remove the unused lock, then starting agents again.

Analysis

__zbx_mutex_lock() doesn't handle explicitly the WAIT_ABANDONED result from WaitForSingleObject() but instead exits.

More info on mutexes in Windows.

Important:

  • lock namespaces are per-session (session 0 for services, other sessions - for users);
  • contrary to SysV locks - Windows keeps track of users and removes the lock if it's unreferenced by any thread in the session.


 Comments   
Comment by Sandis Neilands (Inactive) [ 2015 Oct 30 ]

Attached patch resolves the issue. Its safety is subject to investigation - it might be that with ZBX_MUTEX_PERFSTAT being abandoned we should do some sanity checking before continuing.

Comment by Sandis Neilands (Inactive) [ 2015 Nov 23 ]

On Windows agent has two locks: ZBX_MUTEX_LOG and ZBX_MUTEX_PERFSTAT. The locks can be in abandoned state in two cases:

  1. another agent within session has quit, aborted or has been killed while holding a lock;
  2. one of agent's own threads has quit or has been killed while holding a lock.

In case another agent has quit we can safely continue using the protected resource even though the lock was abandoned because agents do not share resources.

In case one of agent's own threads has quit continuing is slightly more risky as the log file might be corrupted (can't do anything about that) and/or collector's Windows Performance Statistics internal data structures might be inconsistent. Agent quits as a whole if one of its threads has quit so these problems should not propagate.

Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-10033 .

Comment by Sandis Neilands (Inactive) [ 2015 Nov 25 ]

(1) Correction for ZBX-10034 eliminates the first issue. As for the second issue - it should never occur. Agent should exit() if it does.

sandis.neilands RESOLVED in r56883.

sasha CLOSED

Comment by Andris Zeila [ 2015 Dec 07 ]

Successfully tested

Comment by Sandis Neilands (Inactive) [ 2015 Dec 08 ]

Released in:

  • 3.0.0alpha5 r57065.
Generated at Thu Apr 25 03:18:28 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.