[ZBXNEXT-2354] Separate VMware statistics processing from retrieval of VMware contents Created: 2014 Jun 25  Updated: 2019 Feb 22  Resolved: 2015 Feb 12

Status: Closed
Project: ZABBIX FEATURE REQUESTS
Component/s: Proxy (P), Server (S)
Affects Version/s: None
Fix Version/s: 2.2.9, 2.4.4, 2.5.0

Type: Change Request Priority: Trivial
Reporter: Andris Zeila Assignee: Unassigned
Resolution: Fixed Votes: 6
Labels: vmware
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: JPEG File vmware-cpu-statistics.jpg     PNG File vmware-current.png     PNG File vmware-new.png     File vmware-stats.diff    
Issue Links:
Duplicate
is duplicated by ZBXNEXT-2999 The key vmware.vm.memory.size.compres... Open
is duplicated by ZBX-8261 Getting values for VMware items from ... Closed

 Description   

Currently VMware statistics (performance counters) are retrieved and stored together with the rest of VMware data - events, hypervisors and virtual machines (see vmware-current.png). On large installations this can take a lot of time (10+ minutes). As the result the statistics are refreshed only in 10 (or more) minute intervals.

This can be solved by separating performance counter processing and storage from the retrieval of the VMware event, hypervisor and virtual machine data (see vmware-new.png). It would also allow to retrieve performance counters for all monitored entities (hypervisors, virtual machines) with a single request. To further improve statistics gathering some data (like cpu usage) must also be monitored with performance counters (currently performance counters are used to monitor only network, disk and datastore statistics).

It would also allow to easily implement user defined items to monitor hypervisor, virtual machine performance counters - for example:
vmware.vm.perfcounter[{$URL},{HOST.HOST},"cpu/usagemhz[average]"] key could be used to monitor virtual machine cpu usage (see the patch).

The attached graph (vmware-cpu-statistics.jpg) illustrates the difference between the current cpu usage statistics (retrieved with 5 minute interval) and performance counter cpu usage statistics (retrieved with 30 second interval).



 Comments   
Comment by Andris Zeila [ 2014 Jul 16 ]

Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBXNEXT-2354

Comment by Andris Zeila [ 2014 Aug 20 ]

Reopening to merge latest 2.2 changes. Virtual machine disk/network device discovery was rewritten in ZBX-7621 - support for the new dicovery code must be added.

Comment by Andris Zeila [ 2014 Aug 25 ]

Latest 2.2 changes merged, data parsing optimizations added in r48371

Comment by Yuya Kusakabe [ 2014 Sep 07 ]

Which version will include this fix?

Comment by Andris Zeila [ 2014 Sep 11 ]

(1) vmware.vm.powerstate[] is trying to find powerState data in wrong XML location.

RESOLVED in r48943

<dimir> CLOSED

Comment by Andris Zeila [ 2014 Nov 09 ]

(2) After increasing the number of monitored virtual machines to 6000 it was apparent that XML parsing slows things down. For example vm discovery locked vmware collector for ~55 seconds because it was parsing virtual machine and hypervisor names from xml string for each of 6000 virtual machines.

At least we should parse things used in discoveries during data collection so the data retrieval does not lock vmware collector for long time.

And because usually vmware data collection frequency is less than request frequency, it would make sense to parse all required data during data collection - so we don't have to parse the same value from xml multiple times. Also this would reduce the size requirements for vmware shared memory cache.

wiper Change performance entity storage from vector to hashset. This gives performance boost when handling large amount of performance enitties (hypservisors/virtual machines). The value pre-parsing is out of scope and should be handled by ZBX-9038.

RESOLVED in r50594

<dimir> CLOSED

Comment by Andris Zeila [ 2014 Nov 14 ]

(3) When there is no performance data available (either not yet gathered or the performance entity is offline) the item should not become 'not supported'.

wiper RESOLVED in r50699

<dimir> CLOSED

Comment by Andris Zeila [ 2014 Nov 26 ]

(4) Shared memory leak when performance entity is being removed during vmware service update. This is especially noticeable if vmware service update failed (for example with network timeout). In this case all performance entities will be removed, resulting in a huge shared memory leak.

RESOLVED in r 50851

<dimir> CLOSED

Comment by Andris Zeila [ 2014 Dec 02 ]

(5) Compilation error when trying to build without vmware (libcurl + libxml2) support

RESOLVED in r50969

<dimir> CLOSED

Comment by Andris Zeila [ 2014 Dec 17 ]

(6) Memory leak during performance entity cleanup.

RESOLVED in r 51237

<dimir> CLOSED

Comment by dimir [ 2015 Jan 20 ]

(7) [PS] In check_vcenter_hv_perfcounter() and check_vcenter_vm_perfcounter() we get counterid and then call vmware_service_counter_get() function, in which first thing we do we get conterid again. Perhaps we could add passing counterid as an argument to vmware_service_counter_get()?

wiper RESOLVED in r51877

<dimir> CLOSED

Comment by dimir [ 2015 Jan 20 ]

(8) [PS] Suggestion to rename function zbx_vmware_service_start_monitoring() to something like zbx_vmware_service_add_perfcounter().

wiper Also decied to keep performance entity counters in sorted vector, allowing to perform binary searches.
RESOLVED in r51880

<dimir> CLOSED

Comment by Andris Zeila [ 2015 Jan 20 ]

(9) If all performance entity counter values are -1, then those values are ignored (cleared). However a single counter value also could return -1 which also should be ignored rather than generating value conversion error.

wiper RESOLVED in r51744

<dimir> CLOSED

Comment by dimir [ 2015 Jan 22 ]

(10) [PS] I'm a bit worried about multiple xmlCleanupParser() calls in the code:

vl@dimir:ZBXNEXT-2354:ZBXNEXT-2354$ egrep -rI --exclude-dir=.svn xmlCleanupParser src/zabbix_server/
src/zabbix_server/vmware/vmware.c:      xmlCleanupParser();
src/zabbix_server/vmware/vmware.c:      xmlCleanupParser();
src/zabbix_server/vmware/vmware.c:      xmlCleanupParser();
src/zabbix_server/vmware/vmware.c:      xmlCleanupParser();
src/zabbix_server/vmware/vmware.c:      xmlCleanupParser();
src/zabbix_server/vmware/vmware.c:      xmlCleanupParser();
src/zabbix_server/vmware/vmware.c:      xmlCleanupParser();
src/zabbix_server/poller/checks_simple_vmware.c:        xmlCleanupParser();

Here's what comment says about calling that function in libxml2 code:

/**
 * xmlCleanupParser:
 *
 * This function name is somewhat misleading. It does not clean up
 * parser state, it cleans up memory allocated by the library itself.
 * It is a cleanup function for the XML library. It tries to reclaim all
 * related global memory allocated for the library processing.
 * It doesn't deallocate any document related memory. One should
 * call xmlCleanupParser() only when the process has finished using
 * the library and all XML/HTML documents built with it.
 * See also xmlInitParser() which has the opposite function of preparing
 * the library for operations.
 *
 * WARNING: if your application is multithreaded or has plugin support
 *          calling this may crash the application if another thread or
 *          a plugin is still using libxml2. It's sometimes very hard to
 *          guess if libxml2 is in use in the application, some libraries
 *          or plugins may use it without notice. In case of doubt abstain
 *          from calling this function or do it just before calling exit()
 *          to avoid leak reports from valgrind !
 */

Related links:

https://git.gnome.org/browse/libxml2/tree/parser.c#n14002
https://lists.fedoraproject.org/pipermail/devel/2010-January/129117.html

wiper removed xmlCleanupParser() calls, no leaks or other problems appeared.
RESOLVED in r51884

<dimir> CLOSED

Comment by dimir [ 2015 Jan 23 ]

(11) [PS] In src/zabbix_server/vmware/vmware.c:vmware_service_process_perf_entity_data() there is a check on values if (NULL != value && NULL != counter) and it doesn't contain the check against empty instance. This may result in NULL value here perfvalue.first = vmware_shared_strdup(instance);. This situation (instance pointed to NULL) is handled however perhaps it could be worth allocating empty string for first in this case. Please check.

wiper Storing as empty strings rather than NULL values would simplify processing later.
RESOLVED in r51885

<dimir> CLOSED

Comment by dimir [ 2015 Jan 23 ]

(12) [PS] Another thing to check in src/zabbix_server/vmware/vmware.c:vmware_service_update_perf():

        /* get refresh rates */
        for (i = 0; i < entities.values_num; i++)
        {
                local_entity = entities.values[i];

                if (SUCCEED != vmware_service_get_perfcounter_refreshrate(service, easyhandle, local_entity->type,
                                local_entity->id, &local_entity->refresh, &error))
                {
                        zabbix_log(LOG_LEVEL_DEBUG, "cannot get refresh rate for performance entity (type:%s id:%s): %s",
                                        local_entity->type, local_entity->id, error);
                        zbx_free(error);
                }
        }

If we can not get refresh rate we just issue DEBUG message. Perhaps we should issue a WARNING here.

wiper RESOLVED in r51886

<dimir> CLOSED

Comment by dimir [ 2015 Jan 26 ]

(13) [D] There are internal coefficients used in the code to translate KBytes to Bytes, as we like to always deal with bytes in history. In 2.2 these coefficients are not used:

checks_simple_vmware.c:vmware_counter_get(...int coeff, ...) <-- coeff is not used

while in this issue it started to affect:

checks_simple_vmware.c:vmware_service_counter_get(...int coeff, ...) <-- coeff is used

This will result in spikes of network and disk (bps) monitoring data. I guess it was kind of bug which will be fixed in this issue but it will also bring a regression. At least must be mentioned in watsnew and perhaps other places.

wiper Note that VMware guest template has kilobytes in disk read/write item names (Average number of kilobytes read from the disk ...), which is not correct. It should be bytes (like in network interface read/write item names). So we will have to update templates too.

wiper RESOLVED in r51887

<dimir> Agree (will be CLOSED after documenting that)

wiper RESOLVED - see (16)

<dimir> CLOSED

Comment by dimir [ 2015 Jan 27 ]

(14) [D] Let's not forget to document that after this fix it is strongly recommended to enable at least 2 vmware collectors for vmware monitoring.

<richlv> probably worth mentioning in upgrade notes, vmware monitoring pages and whatsnew (including the reason why) ?

wiper RESOLVED, see (16)

<dimir> CLOSED

Comment by dimir [ 2015 Jan 27 ]

(15) [PS] Let's rename next macros to reflect delay, these are not actually meant for TTL:

src/zabbix_server/vmware/vmware.c:ZBX_VMWARE_CACHE_TTL
src/zabbix_server/vmware/vmware.c:ZBX_VMWARE_PERF_TTL

wiper RESOLVED in r51888

<dimir> CLOSED

Comment by dimir [ 2015 Jan 27 ]

Performance improved greatly. Tested with 1 service with 1 hyperviser with 12 virtual machines (8 up, 4 down):

Before the fix (vmware structural data and performance counters data are collected in one process, 1 minute delay by default):

vmware collector #1 [updated 1, removed 0 VMware services in 12.473922 sec, idle 5 sec]
vmware collector #1 [updated 1, removed 0 VMware services in 12.473922 sec, querying VMware services]
vmware collector #1 [updated 1, removed 0 VMware services in 17.624473 sec, idle 5 sec]
vmware collector #1 [updated 1, removed 0 VMware services in 17.624473 sec, querying VMware services]
vmware collector #1 [updated 1, removed 0 VMware services in 13.411039 sec, idle 5 sec]

After the fix (vmware structural data and performance counters data are collected in separate processes, 1 minute delay by default):

vmware collector #2 [updated 0, removed 0 VMware services in 0.550486 sec, querying VMware services]
vmware collector #1 [updated 1, removed 0 VMware services in 1.090372 sec, querying VMware services]
vmware collector #1 [updated 1, removed 0 VMware services in 0.255816 sec, idle 5 sec]
vmware collector #1 [updated 1, removed 0 VMware services in 0.255816 sec, querying VMware services]
vmware collector #2 [updated 1, removed 0 VMware services in 1.092574 sec, idle 5 sec]
vmware collector #1 [updated 0, removed 0 VMware services in 0.530225 sec, idle 5 sec]
vmware collector #2 [updated 1, removed 0 VMware services in 0.260374 sec, idle 5 sec]
Comment by richlv [ 2015 Jan 29 ]

(16) documentation :

  • performance improvements in whatsnew
  • template changes in template change page & upgrade notes
  • fixed template uploaded to templates page
  • logging changes in whatsnew (anywhere else ?) - r51886, maybe others

wiper For now only 2.2 documentation was is updated. The changes in r51886 affected the new code, so no logging changes were made there.

Virtual machine monitoring:

VMware monitoring item keys:

What's new:

Template changes:

Upgrade notes:

martins-v I fixed some typos, so it looks ok to me now for copying to 2.4.4 documentation.

<dimir>

  • added note about recommended value for StartVMwareCollectors in configuration
  • added bit more information in details
  • fixed vmware.hv.fullname key in vmware keys (was vmware.hv.full.name)
  • added link to vmware configuration in whatsnew
  • added link to vmware configuration in upgrade notes

wiper, Could you please add units for vmware keys description where there are Integer return values?

Other than that looks good, please review my changes.

wiper Added units, fixed datastore read description, marked default modes

<dimir> Looks good except for vmware.vm.vfs.fs.size[<url>,<uuid>,<fsname>,<mode>]. According to http://pubs.vmware.com/vi30/sdk/ReferenceGuide/vim.Datastore.Summary.html this is bytes/% right? Perhaps we could add (bytes/%)?

Looks good to me, CLOSED

<richlv> looks like updated templates have not been uploaded, REOPENED
additionally, upgrade notes page does not explain how the issue in an existing template could be fixed

wiper RESOLVED

<richlv> 2.2.9 upgrade notes now mention this change twice;
2.4.4 upgrade notes still do not explain how to fix the issue

wiper Yes, I think it's worth mentioning this change together with the rest of vmware changes and we have to write about it also in Template changes.

<richlv>

  • it seemed highly confusing that we had less information in one section, more in another - i unified them in the "template changes" section, as that change did not actually change the way vmware monitoring itself worked. i also changed plural to singular in the "import" suggestion.
  • regarding the templates on zabbix.org, there is also the same template for 2.2.6, which implied that it got changed in 2.2.6. which issue changed it for 2.2.6 ?
  • the changelog entry probably should have the template component set ?

wiper Thanks, ZBX-7621, right - forgot.

wiper Changelog updated for 2.2, 2.4 and trunk

<richlv> a discussed on irc, apparently some items changed from kb to bytes, too. we should separately list items that changed in the templates, and items that will now return different data

<richlv> this subissue has grown a bit too large, so it will be split up :
19 - uploading of the templates
20 - changelog entries
21 - changed items in templates and items that will return different data now

<richlv> i bolded item keys in whatsnew entries. besides the issues that have been split out (see above), i believe this subissue is now CLOSED

Comment by Andris Zeila [ 2015 Jan 30 ]

Released in:

  • pre-2.2.9rc1 r51913
  • pre-2.4.4rc1 r51914
  • pre-2.5.0 r51916
Comment by Andris Zeila [ 2015 Feb 05 ]

(17) [F] The vmware.hv.perfcounter and vmware.vm.perfcounter keys should be added to help items for simple checks.

wiper RESOLVED in r52028

sasha CLOSED

Comment by Alexander Vladishev [ 2015 Feb 06 ]

(18) String changes:

New translation strings:

  • VMware hypervisor performance counter, <url> - VMware service URL, <uuid> - VMware hypervisor host name, <path> - performance counter path, <instance> - performance counter instance
  • VMware virtual machine performance counter, <url> - VMware service URL, <uuid> - VMware virtual machine host name, <path> - performance counter path, <instance> - performance counter instance

sasha CLOSED

Comment by Andris Zeila [ 2015 Feb 06 ]

Released in:

  • pre-2.2.9rc1 r52046
  • pre-2.4.4rc1 r52047
  • pre-2.5.0 r52048
Comment by richlv [ 2015 Feb 18 ]

(19) uploading of the changed templates to https://www.zabbix.org/wiki/Zabbix_Templates/Official_Templates
(split out from 16)

template Template_Virt_VMware_Guest has been uploaded for 2.2.9 and 2.4.4 versions, as well as for 2.2.6 (this was forgotten in ZBX-7621) by wiper

this looks ok to me, CLOSED

Comment by richlv [ 2015 Feb 18 ]

(20) changelog entries did not have template component set
(split out from 16)

some changes from this issue and ZBX-7621 did not have template component set in the changelog entries

wiper RESOLVED

<richlv> thanks, CLOSED

Comment by richlv [ 2015 Feb 18 ]

(21) documenting changed template items and items that will now return different data
(split out from 16)

a discussed on irc, apparently some items changed from kb to bytes, and some only changed in the template. we should :

  • in the template changes, list exact item changes
  • in the upgrade notes, list exact item keys that will now return different data
  • in the listing of the keys, note which keys (and in which version) changed from what to what

wiper updated documentation:

wiper RESOLVED (2.2)

<richlv> minor change in upgrade notes (it -> they) - other than that looks good to me for adding to 2.4 docs. it looks like there's nothing to change for 3.0 documentation, is that correct ?

wiper Copied changes to 2.4, and you are right - there is nothing to update in 3.0 documentation.
RESOLVED

<richlv> looks goo to me, CLOSED

Comment by Oleksii Zagorskyi [ 2015 Jun 26 ]

Looks like changes here introduces some unclear drawbacks, described in ZBX-9466.

Generated at Sat Apr 20 00:25:49 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.