Resource Utilization Overview

What is Resource Utilization?

Device42 now has “Resource Utilization” features [aka ‘monitoring discovered devices setting] and a new license type available. This powerful module allows for the examiniation of server resource usage data, enabling advanced capacity planning decisions, migration planning, move-group selection, cloud target rightsizing, and support of various other digital tranformation projects as well.

Once enabled, you will see a ‘monitoring’ option on the Hypervisor/*NIX/Windows Discovery settings screen, and stats are currently displayed on Linux and Windows based platform CIs.

Enable resource monitoring option

The functionality of the resource monitoring option depends on the following criteria being met:

- Verify the licensing module is enabled - on/off
    - The Monitoring checkbox (pictured above) will be disabled if the licensing module is disabled
- Verify monitoring is set [checked] at time of discovery
- Verify monitoring option is checked after device has been discovered (assuming job must be run again -- but verify when schedule is set and next time its run, it should bring in data)

Handling of the same IP/machine instance across multiple RCs

If an IP is discovered across multiple RCs, Device42 will currently not monitor that IP again if it is already being monitored by one. If this were permitted, unexpected behavior could result. We will adjust the handling of this based on user feedback.

Monitoring management – An example scenario

Let’s consider three Devices A, B and C, and two Remote Collectors, RC#1 & #2. Monitoring is currently disabled on all three.

You have two discovery jobs configured:
– Job#1 includes Device A and Device B
– Job#2 includes Device B and Device C


You run Job#1 on RC#1 with monitoring enabled. After discovery, you will have:

  • DeviceA with monitoring on RC#1
  • DeviceB with monitoring on RC#1
  • DeviceC without monitoring

Then you decide to run discovery using Job#2 on RC#2 with monitoring disabled.
After this discovery, you will have:

  • DeviceA with monitoring on RC#1
  • DeviceB with monitoring on RC#1
  • DeviceC without monitoring

Then you change the settings on Job #2 and run it from RC#2 with monitoring enabled.
The end result will be:

  • DeviceA with monitoring on RC#1
  • DeviceB with monitoring on RC#1
  • DeviceC with monitoring on RC#2

As you can see, DeviceB does not switch which RC it’s attached to.

If you want to move a device to another RC, this can be accomplished by opening the device list, selecting the device, and setting the Disable monitoring… option. After disabling monitoring, re-run the job yet again with monitoring re-enabled on the new target RC.

Select Unknown Device to View

The difference between between the options deal with the handling of historical data for the device in question. If “Keep Data” is selected, data is stored for as long as is needed, and if the same device were rediscovered, the existing data will be automatically utilized. The second option simply deletes all existing data from the server. When a previously existant device is re-discovered with the second option selected, its history begins anew.

Now re-run Job#2 on RC#2 with monitoring enabled once again.
After that run, you can see the device has moved to RC2:

  • DeviceA with monitoring on RC#1
  • DeviceB with monitoring on RC#2
  • DeviceC with monitoring on RC#2

TECHNICAL INFORMATION: How Resource data is stored

Monitoring data is kept on the RC in a TSDB. A dedicated database called sensors is used for this purpose, and it contains the following series:

  • infeeds – stores infeeds stats
  • outlets – stores outlets stats
  • banks – stores banks stats
  • battery – stores battery stats
  • device – stores device sensors – usually load, power factor, etc.
  • env_sensor – stores all types of device_sensors – humidity, temperature, cpu, etc.

You can think about these series as if they were Excel sheets, in which the first column is always a timestamp.

For example, a memory series looks like this:

A memory series in Device42

In general, chart generation doesn’t use all data points, as there tends to be quite a lot of them [for example, a 30 sec. interval for a month = 86,400 data points]. Aggregation is used instead, which is a common way to visualize this type of data. Aggregation takes multiple points and converts their values to one depending on the selected aggregation function. Currently, Device42 does wthis one of three ways – MIN, AVG and MAX.

As an example, to generate AVG physical values from 5 minutes intervals with a point every minute from the table in the screenshot above, we will get:

  • (53.242 + 51.672) / 2 = 52.457
  • (52.688 + 52.676) / 2 = 52.682
  • etc.

The MIN will return:

  • 51.672
  • 52.676
  • etc.

Once checked, it will correlate monitoring data to discovered device.


What happens if an RC is down?

Since we do a query to get data from RC – if RC is down, nothing will be shown. There will be empty data gaps in the charts/reports on the place of offline time after RC restart.

Data Capture Interval

Available intervals:
– SNMP -> 1 second
– Linux -> 5 seconds
– Windows -> 15 seconds

Data Visualization

There it “Trends” button in the top left corner of devices that has RC attached. Button is displayed only if monitoring is active and license allows it.

We will use device_sensors as added in power monitoring as table to add all of the above as “sensors” as it already exists. And we will use same trends button to visualize this data
Trend reports will show in upper right corner of device pages
Verify these trend pages don’t show on devices this isn’t applicable for…


Captured Data Details

CPU:

  • CPU: the mathematical mean of CPU-1…N loads expressed as a percent
  • CPU-1…N: is the real CPU load as a percent

Memory

  • Total: sum of physical and swap in megabytes
  • Physical: RAM used in megabytes
  • Swap: swap/page file used in megabytes

Disks

  • Name: name of the HDD
  • WriteLatency: latency of the write operations in ms
  • WriteIORate: speed of the write operations in MB/s
  • WriteIOPS: number of write operations per second
  • WriteTransfer: raw number of bytes written to disk
  • ReadLatency: latency of the read operations in ms
  • ReadIORate: speed of the read operations in MB/s
  • ReadIOPS: number of read operations per second
  • ReadTransfer: raw number of bytes read from disk

Network

  • Name: name of the adapter
  • InSpeed: download speed in MB/s
  • InTransfer: raw number of bytes received by adapter
  • OutSpeed: upload speed in MB/s
  • OutTransfer: raw number of bytes transmitted by adapter

For the most part, Device42 will display aggregated values.
The only exception to this in v1 is for Transfer values. They are written as raw numbers, and are constantly growing so there is logic behind displaying the values themselves. Instead, the difference for a given interval is displayed.

So, let’s look at the following example data for ReadTransfer:

00:01 – 100 bytes
00:02 – 123 bytes
00:03 – 234 bytes

If a user were to request data between 00:01 and 00:03 with density=3 (see APIs section for more information about density), Device42 will print:

00:01 – 0 bytes
00:02 – 23 bytes
00:03 – 111 bytes

If results cannot be retreived from an RC (e.g. the RC is down, etc.), an “Inaccessible Remote Collector” message will be displayed on trend reports.


Reporting

There are two kinds of reports available based on captured data:

  • Peak usage report – sum up the data [detailed below].

  • Data as captured – this will show the RAW data, as captured, every N minutes for interval X

Peak calculations:

  • CPU – A single # that represents the sum of all (cpu power times percentage peak usage)

  • Memory – Total Peak, RAM, Swap and RAM + Swap

  • Network – Peak per card

  • Disk – Peak IO across disks and Peak latency


APIs

Currently, there is no API endpoint to retreive the data in JSON format, but data can be fetched as a .CSV:

/service/data/v1.0/trends/?id=2714&metric=AVG&timezoneoffset=-180&timeperiod=3&density=110&end_date=09%2F24%2F17+21%3A17%3A24

General parameters

  • type – type of report, currently supports only device. [Optional].

  • id – device ID

  • ids – comma separated list of IDs. Optional.

  • metric – is actually aggregation function that will be used. Can be AVG, MIN, MAX.

  • timezoneoffset – is your TZ GMT offset in minutes. For Moscow it is -180 (minus) and for NY 240 (without plus)

  • end_date – is the date of the final data point in format: 12/31/17 15:16:17 (US date + 24H time)

  • timeperiod – is the number of hours that you want to observe.

Possible values:

  • 30 minutes
  • 1 hour
  • 3 hours
  • 6 hours
  • 12 hours
  • 24 hours
  • 7 days
  • 31 days
  • 183 days

Data points control parameters

For data points control, you should use one of these parameters. Choose one or the other, as trying to use both will cause one to override the other.

  • interval – specify the number of seconds between data points. If you want to get AVG/MIN/MAX data at 5 minute intervals for last 24 hours, set interval=300 and you will receive 288 data points.

  • density – the number of the points to collect per interval. This is similar to interval but you should use it if you want to get exact number of the points for given interval. If you will use time period=6 (24 hours) and density=100. You will get 100 points with interval ~14.5 minutes. With density 1000 you will get 1000 points – with 1.5 interval.

There is an important limitation for both control parameters. If the device polling interval is N seconds, and N > interval, the RC will reset interval to N. For example, if the polling interval for device is 15 seconds and you set density=1000 & period=1 (30 min) you will not get 1000 points. You will get 30 min * 60 second = 1800 sec / 15 sec polling interval = 120 points.

CSV contains next type-measure combinations of data:

  • Cpu-load – aggregated cpu load for selected interval in % (for CPU without number it is averaged for all numbered CPUs)

  • Mem-physical – aggregated physical memory used

  • Mem-swap – aggregated swap used

  • Disk-(total,write,read)_iops – aggregated iops for disk

  • Disk-(total,write,read)_iorate – aggregated iorate for disk

  • Disk-(total,write,read)_latency – aggregated latency or disk

  • Disk-(total,write,read)_transfer – raw transfer for disk at the end of the interval

  • Disk-(total,write,read)_transfer_diff – difference between raw transfer at end and start of interval

  • Nic-(in,out)_speed – aggregated speed for interface

  • Nic-(in,out)_transfer – raw transfer for operation disk at the end of the interval

  • Nic-(in,out)_tranfer_diff – difference between raw transfer at end and start of interval

What if my RC is offline?

If your target Remote Collector is offline, you will not be able to fetch data from it. All fields will either come back empty or will display the - character. Charts, too, will be empty. One exception is the PDU main page, which will display the latest values because its data is cached.

Resource Utilization v1 note — USE STATIC IPs ONLY:

If an RC and the Windows discovery service are both using DHCP IPs, automated IP changes can break the connection between them, and therefore effect running jobs etc. It is STRONGLY recommended that STATIC IPs be used for all Device42 appliances, RCs, etc.

In the case that DHCP does break a connection, restart the job to resume monitoring. As of v1, DHCP IPs are NOT a supported configuration.