Skip to main content

Rollups & Outages Guide

ThingConnect Pulse automatically processes your monitoring data to provide meaningful analytics and outage tracking. This guide explains how the rollup system works and how outages are detected and reported.

Rollup System Overview

The rollup system automatically aggregates raw monitoring data into time-based summaries, providing both performance and long-term trend analysis.

Why Rollups Matter

Performance Benefits

  • Faster queries for historical data
  • Reduced storage requirements for long-term trends
  • Efficient dashboards and reporting

Analytical Value

  • Statistical summaries reveal patterns
  • Uptime percentages for SLA tracking
  • Response time trends for capacity planning

Automatic Processing

Background Operation

  • Runs every 5 minutes automatically
  • No manual intervention required
  • Processes completed time windows only

Data Integrity

  • Raw data remains unchanged
  • Rollups supplement, don't replace original data
  • Failed rollup jobs are retried automatically

15-Minute Rollups

Every 15 minutes, Pulse calculates aggregated statistics from all raw check data in that window.

Time Window Alignment

Fixed Boundaries

  • 00:00:00 - 00:14:59
  • 00:15:00 - 00:29:59
  • 00:30:00 - 00:44:59
  • 00:45:00 - 00:59:59

Processing Timing

  • Window data processed after completion
  • Typically available 5-10 minutes after window end
  • No partial window calculations

Uptime Percentage Calculation

The most important metric for availability monitoring:

Uptime % = (Successful Probes / Total Probes) × 100

Example Calculation

  • 15-minute window with 30-second probe intervals = 30 probes
  • 28 successful probes + 2 failed probes = 30 total
  • Uptime = (28 / 30) × 100 = 93.3%

Key Characteristics

  • Based on actual probe results, not time duration
  • Failed probes count regardless of failure reason
  • More frequent probing = more accurate percentage

Average Response Time

Calculated from successful probes only:

Avg RTT = Sum of Response Times / Successful Probes

Important Notes

  • Failed probes are excluded from calculation
  • null if no successful probes in window
  • Represents actual performance when service is available

Example Calculation

  • Successful probes: 12ms, 15ms, 8ms, 22ms
  • Average RTT = (12 + 15 + 8 + 22) / 4 = 14.25ms

Down Events Counting

Tracks service instability by counting status transitions:

Down Events = Number of UP → DOWN transitions

What It Measures

  • Service interruptions, not total downtime
  • Frequency of failures, not duration
  • Indicates flapping or unstable services

Example Scenarios

  • Steady UP: 0 down events
  • Single outage: 1 down event
  • Flapping service: Multiple down events
Understanding Down Events

A high uptime percentage with many down events indicates frequent brief interruptions. This pattern often points to network instability or service flapping.

Daily Rollups

Each day, Pulse creates summary statistics covering the entire 24-hour period.

Daily Calculation Process

Data Source

  • Aggregated from 15-minute rollups (not raw data)
  • 96 rollup windows per day (24 hours × 4 windows/hour)
  • More efficient than processing thousands of raw checks

Timing

  • Calculated shortly after midnight
  • Based on local server timezone
  • Previous day's data is processed

Daily Metrics

Daily Uptime Percentage

Daily Uptime = Average of all 15-minute uptime percentages

Daily Average Response Time

Daily Avg RTT = Weighted average of 15-minute averages

Daily Down Events

Daily Down Events = Sum of all 15-minute down events

Use Cases for Daily Rollups

Management Reporting

  • Monthly availability summaries
  • Service level agreement (SLA) tracking
  • Executive dashboards with daily KPIs

Trend Analysis

  • Long-term performance patterns
  • Seasonal variations in availability
  • Capacity planning based on response times

Comparative Analysis

  • Week-over-week performance comparison
  • Impact assessment of configuration changes
  • Historical baseline establishment

Outage Detection System

Pulse automatically identifies and tracks service outages using a sophisticated flap damping algorithm.

Flap Damping Logic

Why Flap Damping?

  • Prevents false alarms from brief network hiccups
  • Focuses on sustained availability issues
  • Reduces alert noise and improves accuracy

2/2 Threshold Algorithm

  • Outage Start: 2 consecutive failed probes
  • Outage End: 2 consecutive successful probes
  • Balances responsiveness with stability

Outage Lifecycle

Outage Detection

  1. First Failure: Probe fails, endpoint status remains UP
  2. Second Failure: Probe fails again, outage begins
  3. Outage Active: Status changes to DOWN
  4. First Success: Probe succeeds, endpoint status still DOWN
  5. Second Success: Probe succeeds again, outage ends
  6. Recovery Complete: Status changes back to UP

Timestamp Recording

Start Timestamp

  • Time of the first failed probe
  • Captures actual beginning of service interruption
  • Used for outage duration calculation

End Timestamp

  • Time of the first successful probe after failure
  • Indicates when service actually recovered
  • May differ from status change time due to damping

Outage Duration Calculation

Duration = End Timestamp - Start Timestamp

Precision

  • Calculated in seconds for accuracy
  • Includes the confirmation period
  • Represents total time service was unavailable

Example Timeline

  • 14:30:15 - First failure (outage starts)
  • 14:30:45 - Second failure (status → DOWN)
  • 14:33:20 - First success (service recovered)
  • 14:33:50 - Second success (status → UP)
  • Outage Duration: 3 minutes 5 seconds

Outage Classification

Sustained Outages

  • Clear start and end times
  • Duration measured in minutes or hours
  • Typically infrastructure or service failures

Flapping Detection

  • Multiple short outages in succession
  • Pattern indicates instability
  • May require different response than single outage
Outage vs Status

An endpoint's current status (UP/DOWN) reflects the flap-damped state, while outage records track the actual service interruption periods for historical analysis.

Analytics and Reporting

Key Performance Indicators (KPIs)

Availability Metrics

  • Monthly uptime percentage
  • Mean time between failures (MTBF)
  • Service level agreement compliance

Performance Metrics

  • Average response time trends
  • 95th percentile response times
  • Performance degradation detection

Reliability Metrics

  • Outage frequency and duration
  • Recovery time objectives (RTO)
  • Planned vs. unplanned downtime

Trend Analysis Techniques

Comparative Analysis

  • Week-over-week availability comparison
  • Before/after performance analysis
  • Seasonal pattern identification

Threshold Monitoring

  • Response time degradation alerts
  • Availability below SLA thresholds
  • Unusual down event frequency

Capacity Planning

  • Response time growth trends
  • Peak usage period identification
  • Infrastructure scaling decisions

Reporting Best Practices

Time Period Selection

  • Use 15-minute rollups for daily/weekly reports
  • Use daily rollups for monthly/quarterly reports
  • Raw data only for incident analysis

Metric Interpretation

  • High uptime with many down events = instability
  • Increasing response times = capacity concerns
  • Consistent patterns = predictable behavior

Data Export and Integration

  • CSV export for external analysis
  • API access for automated reporting
  • Database queries for custom analytics

Rollup Data Retention

Storage Policies

15-Minute Rollups

  • Unlimited retention - never deleted
  • Minimal storage impact (few KB per endpoint per day)
  • Enables long-term trending and analysis

Daily Rollups

  • Unlimited retention - permanent storage
  • Extremely compact (bytes per endpoint per day)
  • Perfect for multi-year historical analysis

Raw Data

  • 60-day retention - automatically pruned
  • Most detailed but largest storage requirement
  • Available for recent incident analysis

Performance Optimization

Query Efficiency

  • Rollups dramatically improve query speed
  • Historical dashboards load faster
  • Large date range analysis becomes practical

Storage Efficiency

  • 96 rollup records vs thousands of raw checks per day
  • Minimal database growth for long-term storage
  • Cost-effective for extended retention periods

Monitoring Your Monitoring

System Health Indicators

Rollup Processing

  • Check for failed rollup jobs in logs
  • Verify rollup data is current
  • Monitor database storage growth

Data Quality

  • Consistent probe execution intervals
  • Reasonable response time values
  • Proper outage detection sensitivity

Troubleshooting Rollups

Missing Rollup Data

  • Check background service operation
  • Review rollup job logs for errors
  • Verify database connectivity and permissions

Incorrect Calculations

  • Validate raw data quality
  • Check timezone configuration
  • Review probe configuration consistency

Next Steps