The CloudOps dashboard shows quick, easy to read performance and configuration information about the various cloud accounts that are onboarded. It offers a bird’s eye view of operations and brings to notice important issues which can have an impact on smooth cloud operations.

This dashboard is divided into four gauges:

  1. Tenant Based Summary
  2. Information
  3. Insights
  4. Inference

Tenant Based Summary

If there are multiple tenants under an Account (Organisation), they can view multiple tiles under this section. Each tile corresponds to a Tenant and it shows consolidated information of cloud activity, monitoring alerts and automation failures in that tenant. All other details in below sections of the CloudOps dashboard are loaded based on the Tenant selected.

Information

This section displays the status summary of cloud operations for the selected Tenant. There are separate tiles for

  1. Activity log
  2. Threshold Alerts
  3. Automation failures
  4. Person Hours Saved

You can click on each tile to view a table below with the relevant information displayed.

Activity Log

This is a list of activities that happened in all cloud accounts mapped to this Tenant. Corestack does capture and store each and every activity performed on your cloud accounts, be it through Corestack or directly using the AWS or Azure portal.

For example, provisioning a VM, deleting a security group rule etc. are all marked as activities. This helps monitor the activities performed in your cloud(s). An increase in the activity log number is a cause of concern. You can also choose to be notified over email / webhooks for specific critical activities such as provisioning / deletion of certain resources.

This section shows activities for last 24 hours and only for the specific activities which you selected while onboarding a cloud account. Notifications are also sent to Email addresses or webhooks configured while onboarding the cloud account.

This dashboard will only list activities that you choose to be notified on. The complete list of activities is available in reports section.

Threshold Alerts

These are alerts generated from monitoring your cloud resources. Alerts indicate that a metric is trespassing its preset threshold limit. An alert metric could be a spike in CPU usage or VM downtime. Threshold values can be altered by editing the cloud account in OnBoarding section based on monitoring priorities.

The monitoring can be the respective cloud’s native monitoring (AWS CloudWatch, Azure Monitor) or an integrated monitoring tool such as Zabbix.

Only the specific resource monitoring alerts configured as part of cloud account onboarding are displayed here. Notifications are also sent to Email addresses or webhooks configured while onboarding the cloud account.

Click on an alert for a more detailed view, analysis and methods of resolution.

Analysis

Metric utilization trends can be viewed for a day or over a week or over a month. You can also see the Machine learning based forecast of the utilization for the next 15 days. This helps you to plan and take informed decisions.

The right panel is split into three sections: Observation, Prediction and Prescribe.

Observation: This section shows the deviations in the metric. There is a comparison of the average threshold of a given metric versus the recorded deviation. This list shows the top three deviations noted by the system.

Prediction: This section determines the variation that the utilization of the metric will display in the next 24 hours and the next 7 days.

Prescribe: In this section, you can rewrite the threshold condition, based on the usage. Depending on the average trend, the buffer value can be increased by say 20%, thereby changing the threshold limit.

Resolve

Resolve threshold alerts, using the Resolve button provided next to each of the alerts.

This opens a pop-up box with further options on resolution. There is a confidence level assigned to each resolution option. CoreStack’s Machine Learning capabilities will internalize the decisions made each time and will increase the confidence levels over a period of time.

There are 3 resolution actions available for Virtual Machines. The resolution actions will vary based on the resource types and the monitoring threshold. Confidence level is calculated based on the previous actions performed by the users:

  1. Stop the virtual machine
  2. Start the virtual machine
  3. Resize the virtual machine

You can choose either of the actions and apply from here to resolve the alert.

Automation Failures

CoreStack employs automation features such as Templates and Scripts to make operations streamlined and automatic. At times there are automation issues such as permission issues, incorrect parameters etc. which stop these Templates and Scripts from getting executed. These failures appear in the Automation Failures list which will also be notified to account admin through email.

Click “View” to view the details on each of the failures and attempt resolution.

As can be seen, the status of the automation job is displayed as “Create Failed”. Check the “Output” section in the right panel and scroll down to the last step that failed – you will see the error message that shows the reason for failure of that step.

Once you have understood the reason and have taken the steps for resolution, to re-run the job, navigate to the “Action” column and select rerun from the drop-down menu. You can also rerun the job from the Template / Script directly.

Person Hours Saved

CoreStack’s automation capabilities helps organizations cut back on the person hours otherwise spent on manual management and governance of cloud environments. This counter displays the number of person hours saved by CoreStack.

Hover on the “I” icon to view the break-up in terms of Level-1, Level-2 and Level-3 (indicates the complexity of the operation)

The numbers are calculated based on the settings provided for each template. This is pre-defined for marketplace templates. For templates that you upload, you can add/modify them as required.

Insights

This section displays the state of the cloud environment over the past 30 days. The number of activity logs and threshold alerts determine the noisiness of cloud accounts and displays those accounts that face problems frequently.

Last 30 days trends

This line graph displays the number of activity logs, automation failures and threshold alerts everyday over a 30-day period. This helps to monitor spikes and identify reasons for those.

Top 5 Noisy Accounts (Across Cloud Providers)

This pie chart showcases the top 5 cloud accounts that have high number of threshold alerts and activity logs in the past 30 days.

Top 5 Noisy Resources (Across Cloud Accounts)

This pie chart showcases the top 5 resources such as virtual machines/clusters that have high number of threshold alerts and activity logs.

Inference

This section displays the forecasts in the increase/decrease of threshold alerts. The alert predictions are shown for the next 1 day, 2 days and 15 days duration.