Rollups & Outages Guide

ThingConnect Pulse automatically processes your monitoring data to provide meaningful analytics and outage tracking. This guide explains how the rollup system works and how outages are detected and reported.

Rollup System Overview

The rollup system automatically aggregates raw monitoring data into time-based summaries, providing both performance and long-term trend analysis.

Why Rollups Matter

Performance Benefits

Faster queries for historical data
Reduced storage requirements for long-term trends
Efficient dashboards and reporting

Analytical Value

Statistical summaries reveal patterns
Uptime percentages for SLA tracking
Response time trends for capacity planning

Automatic Processing

Background Operation

Runs every 5 minutes automatically
No manual intervention required
Processes completed time windows only

Data Integrity

Raw data remains unchanged
Rollups supplement, don't replace original data
Failed rollup jobs are retried automatically

15-Minute Rollups

Every 15 minutes, Pulse calculates aggregated statistics from all raw check data in that window.

Time Window Alignment

Fixed Boundaries

00:00:00 - 00:14:59
00:15:00 - 00:29:59
00:30:00 - 00:44:59
00:45:00 - 00:59:59

Processing Timing

Window data processed after completion
Typically available 5-10 minutes after window end
No partial window calculations

Uptime Percentage Calculation

The most important metric for availability monitoring:

Uptime % = (Successful Probes / Total Probes) × 100

Example Calculation

15-minute window with 30-second probe intervals = 30 probes
28 successful probes + 2 failed probes = 30 total
Uptime = (28 / 30) × 100 = 93.3%

Key Characteristics

Based on actual probe results, not time duration
Failed probes count regardless of failure reason
More frequent probing = more accurate percentage

Average Response Time

Calculated from successful probes only:

Avg RTT = Sum of Response Times / Successful Probes

Important Notes

Failed probes are excluded from calculation
null if no successful probes in window
Represents actual performance when service is available

Example Calculation

Successful probes: 12ms, 15ms, 8ms, 22ms
Average RTT = (12 + 15 + 8 + 22) / 4 = 14.25ms

Down Events Counting

Tracks service instability by counting status transitions:

Down Events = Number of UP → DOWN transitions

What It Measures

Service interruptions, not total downtime
Frequency of failures, not duration
Indicates flapping or unstable services

Example Scenarios

Steady UP: 0 down events
Single outage: 1 down event
Flapping service: Multiple down events

Understanding Down Events

A high uptime percentage with many down events indicates frequent brief interruptions. This pattern often points to network instability or service flapping.

Daily Rollups

Each day, Pulse creates summary statistics covering the entire 24-hour period.

Daily Calculation Process

Data Source

Aggregated from 15-minute rollups (not raw data)
96 rollup windows per day (24 hours × 4 windows/hour)
More efficient than processing thousands of raw checks

Timing

Calculated shortly after midnight
Based on local server timezone
Previous day's data is processed

Daily Metrics

Daily Uptime Percentage

Daily Uptime = Average of all 15-minute uptime percentages

Daily Average Response Time

Daily Avg RTT = Weighted average of 15-minute averages

Daily Down Events

Daily Down Events = Sum of all 15-minute down events

Use Cases for Daily Rollups

Management Reporting

Monthly availability summaries
Service level agreement (SLA) tracking
Executive dashboards with daily KPIs

Trend Analysis

Long-term performance patterns
Seasonal variations in availability
Capacity planning based on response times

Comparative Analysis

Week-over-week performance comparison
Impact assessment of configuration changes
Historical baseline establishment

Outage Detection System

Pulse automatically identifies and tracks service outages using a sophisticated flap damping algorithm.

Flap Damping Logic

Why Flap Damping?

Prevents false alarms from brief network hiccups
Focuses on sustained availability issues
Reduces alert noise and improves accuracy

2/2 Threshold Algorithm

Outage Start: 2 consecutive failed probes
Outage End: 2 consecutive successful probes
Balances responsiveness with stability

Outage Lifecycle

Outage Detection

First Failure: Probe fails, endpoint status remains UP
Second Failure: Probe fails again, outage begins
Outage Active: Status changes to DOWN
First Success: Probe succeeds, endpoint status still DOWN
Second Success: Probe succeeds again, outage ends
Recovery Complete: Status changes back to UP

Timestamp Recording

Start Timestamp

Time of the first failed probe
Captures actual beginning of service interruption
Used for outage duration calculation

End Timestamp

Time of the first successful probe after failure
Indicates when service actually recovered
May differ from status change time due to damping

Outage Duration Calculation

Duration = End Timestamp - Start Timestamp

Precision

Calculated in seconds for accuracy
Includes the confirmation period
Represents total time service was unavailable

Example Timeline

14:30:15 - First failure (outage starts)
14:30:45 - Second failure (status → DOWN)
14:33:20 - First success (service recovered)
14:33:50 - Second success (status → UP)
Outage Duration: 3 minutes 5 seconds

Outage Classification

Sustained Outages

Clear start and end times
Duration measured in minutes or hours
Typically infrastructure or service failures

Flapping Detection

Multiple short outages in succession
Pattern indicates instability
May require different response than single outage

Outage vs Status

An endpoint's current status (UP/DOWN) reflects the flap-damped state, while outage records track the actual service interruption periods for historical analysis.

Analytics and Reporting

Key Performance Indicators (KPIs)

Availability Metrics

Monthly uptime percentage
Mean time between failures (MTBF)
Service level agreement compliance

Performance Metrics

Average response time trends
95th percentile response times
Performance degradation detection

Reliability Metrics

Outage frequency and duration
Recovery time objectives (RTO)
Planned vs. unplanned downtime

Trend Analysis Techniques

Comparative Analysis

Week-over-week availability comparison
Before/after performance analysis
Seasonal pattern identification

Threshold Monitoring

Response time degradation alerts
Availability below SLA thresholds
Unusual down event frequency

Capacity Planning

Response time growth trends
Peak usage period identification
Infrastructure scaling decisions

Reporting Best Practices

Time Period Selection

Use 15-minute rollups for daily/weekly reports
Use daily rollups for monthly/quarterly reports
Raw data only for incident analysis

Metric Interpretation

High uptime with many down events = instability
Increasing response times = capacity concerns
Consistent patterns = predictable behavior

Data Export and Integration

CSV export for external analysis
API access for automated reporting
Database queries for custom analytics

Rollup Data Retention

Storage Policies

15-Minute Rollups

Unlimited retention - never deleted
Minimal storage impact (few KB per endpoint per day)
Enables long-term trending and analysis

Daily Rollups

Unlimited retention - permanent storage
Extremely compact (bytes per endpoint per day)
Perfect for multi-year historical analysis

Raw Data

60-day retention - automatically pruned
Most detailed but largest storage requirement
Available for recent incident analysis

Performance Optimization

Query Efficiency

Rollups dramatically improve query speed
Historical dashboards load faster
Large date range analysis becomes practical

Storage Efficiency

96 rollup records vs thousands of raw checks per day
Minimal database growth for long-term storage
Cost-effective for extended retention periods

Monitoring Your Monitoring

System Health Indicators

Rollup Processing

Check for failed rollup jobs in logs
Verify rollup data is current
Monitor database storage growth

Data Quality

Consistent probe execution intervals
Reasonable response time values
Proper outage detection sensitivity

Troubleshooting Rollups

Missing Rollup Data

Check background service operation
Review rollup job logs for errors
Verify database connectivity and permissions

Incorrect Calculations

Validate raw data quality
Check timezone configuration
Review probe configuration consistency

Next Steps

Data Model: Understand the underlying data structure
Live Board & History: Use the web interface for analysis
Troubleshooting: Resolve rollup and outage tracking issues
API Reference: Access rollup data programmatically

Rollup System Overview​

Why Rollups Matter​

Automatic Processing​

15-Minute Rollups​

Time Window Alignment​

Uptime Percentage Calculation​

Average Response Time​

Down Events Counting​

Daily Rollups​

Daily Calculation Process​

Daily Metrics​

Use Cases for Daily Rollups​

Outage Detection System​

Flap Damping Logic​

Outage Lifecycle​

Outage Detection​

Timestamp Recording​

Outage Duration Calculation​

Outage Classification​

Analytics and Reporting​

Key Performance Indicators (KPIs)​

Trend Analysis Techniques​

Reporting Best Practices​

Rollup Data Retention​

Storage Policies​

Performance Optimization​

Monitoring Your Monitoring​

System Health Indicators​

Troubleshooting Rollups​

Next Steps​