Rollups & Outages Guide
ThingConnect Pulse automatically processes your monitoring data to provide meaningful analytics and outage tracking. This guide explains how the rollup system works and how outages are detected and reported.
Rollup System Overview
The rollup system automatically aggregates raw monitoring data into time-based summaries, providing both performance and long-term trend analysis.
Why Rollups Matter
Performance Benefits
- Faster queries for historical data
- Reduced storage requirements for long-term trends
- Efficient dashboards and reporting
Analytical Value
- Statistical summaries reveal patterns
- Uptime percentages for SLA tracking
- Response time trends for capacity planning
Automatic Processing
Background Operation
- Runs every 5 minutes automatically
- No manual intervention required
- Processes completed time windows only
Data Integrity
- Raw data remains unchanged
- Rollups supplement, don't replace original data
- Failed rollup jobs are retried automatically
15-Minute Rollups
Every 15 minutes, Pulse calculates aggregated statistics from all raw check data in that window.
Time Window Alignment
Fixed Boundaries
- 00:00:00 - 00:14:59
- 00:15:00 - 00:29:59
- 00:30:00 - 00:44:59
- 00:45:00 - 00:59:59
Processing Timing
- Window data processed after completion
- Typically available 5-10 minutes after window end
- No partial window calculations
Uptime Percentage Calculation
The most important metric for availability monitoring:
Uptime % = (Successful Probes / Total Probes) × 100
Example Calculation
- 15-minute window with 30-second probe intervals = 30 probes
- 28 successful probes + 2 failed probes = 30 total
- Uptime = (28 / 30) × 100 = 93.3%
Key Characteristics
- Based on actual probe results, not time duration
- Failed probes count regardless of failure reason
- More frequent probing = more accurate percentage
Average Response Time
Calculated from successful probes only:
Avg RTT = Sum of Response Times / Successful Probes
Important Notes
- Failed probes are excluded from calculation
nullif no successful probes in window- Represents actual performance when service is available
Example Calculation
- Successful probes: 12ms, 15ms, 8ms, 22ms
- Average RTT = (12 + 15 + 8 + 22) / 4 = 14.25ms
Down Events Counting
Tracks service instability by counting status transitions:
Down Events = Number of UP → DOWN transitions
What It Measures
- Service interruptions, not total downtime
- Frequency of failures, not duration
- Indicates flapping or unstable services
Example Scenarios
- Steady UP: 0 down events
- Single outage: 1 down event
- Flapping service: Multiple down events
A high uptime percentage with many down events indicates frequent brief interruptions. This pattern often points to network instability or service flapping.
Daily Rollups
Each day, Pulse creates summary statistics covering the entire 24-hour period.
Daily Calculation Process
Data Source
- Aggregated from 15-minute rollups (not raw data)
- 96 rollup windows per day (24 hours × 4 windows/hour)
- More efficient than processing thousands of raw checks
Timing
- Calculated shortly after midnight
- Based on local server timezone
- Previous day's data is processed
Daily Metrics
Daily Uptime Percentage
Daily Uptime = Average of all 15-minute uptime percentages
Daily Average Response Time
Daily Avg RTT = Weighted average of 15-minute averages
Daily Down Events
Daily Down Events = Sum of all 15-minute down events
Use Cases for Daily Rollups
Management Reporting
- Monthly availability summaries
- Service level agreement (SLA) tracking
- Executive dashboards with daily KPIs
Trend Analysis
- Long-term performance patterns
- Seasonal variations in availability
- Capacity planning based on response times
Comparative Analysis
- Week-over-week performance comparison
- Impact assessment of configuration changes
- Historical baseline establishment
Outage Detection System
Pulse automatically identifies and tracks service outages using a sophisticated flap damping algorithm.
Flap Damping Logic
Why Flap Damping?
- Prevents false alarms from brief network hiccups
- Focuses on sustained availability issues
- Reduces alert noise and improves accuracy
2/2 Threshold Algorithm
- Outage Start: 2 consecutive failed probes
- Outage End: 2 consecutive successful probes
- Balances responsiveness with stability
Outage Lifecycle
Outage Detection
- First Failure: Probe fails, endpoint status remains UP
- Second Failure: Probe fails again, outage begins
- Outage Active: Status changes to DOWN
- First Success: Probe succeeds, endpoint status still DOWN
- Second Success: Probe succeeds again, outage ends
- Recovery Complete: Status changes back to UP
Timestamp Recording
Start Timestamp
- Time of the first failed probe
- Captures actual beginning of service interruption
- Used for outage duration calculation
End Timestamp
- Time of the first successful probe after failure
- Indicates when service actually recovered
- May differ from status change time due to damping
Outage Duration Calculation
Duration = End Timestamp - Start Timestamp
Precision
- Calculated in seconds for accuracy
- Includes the confirmation period
- Represents total time service was unavailable
Example Timeline
- 14:30:15 - First failure (outage starts)
- 14:30:45 - Second failure (status → DOWN)
- 14:33:20 - First success (service recovered)
- 14:33:50 - Second success (status → UP)
- Outage Duration: 3 minutes 5 seconds
Outage Classification
Sustained Outages
- Clear start and end times
- Duration measured in minutes or hours
- Typically infrastructure or service failures
Flapping Detection
- Multiple short outages in succession
- Pattern indicates instability
- May require different response than single outage
An endpoint's current status (UP/DOWN) reflects the flap-damped state, while outage records track the actual service interruption periods for historical analysis.
Analytics and Reporting
Key Performance Indicators (KPIs)
Availability Metrics
- Monthly uptime percentage
- Mean time between failures (MTBF)
- Service level agreement compliance
Performance Metrics
- Average response time trends
- 95th percentile response times
- Performance degradation detection
Reliability Metrics
- Outage frequency and duration
- Recovery time objectives (RTO)
- Planned vs. unplanned downtime
Trend Analysis Techniques
Comparative Analysis
- Week-over-week availability comparison
- Before/after performance analysis
- Seasonal pattern identification
Threshold Monitoring
- Response time degradation alerts
- Availability below SLA thresholds
- Unusual down event frequency
Capacity Planning
- Response time growth trends
- Peak usage period identification
- Infrastructure scaling decisions
Reporting Best Practices
Time Period Selection
- Use 15-minute rollups for daily/weekly reports
- Use daily rollups for monthly/quarterly reports
- Raw data only for incident analysis
Metric Interpretation
- High uptime with many down events = instability
- Increasing response times = capacity concerns
- Consistent patterns = predictable behavior
Data Export and Integration
- CSV export for external analysis
- API access for automated reporting
- Database queries for custom analytics
Rollup Data Retention
Storage Policies
15-Minute Rollups
- Unlimited retention - never deleted
- Minimal storage impact (few KB per endpoint per day)
- Enables long-term trending and analysis
Daily Rollups
- Unlimited retention - permanent storage
- Extremely compact (bytes per endpoint per day)
- Perfect for multi-year historical analysis
Raw Data
- 60-day retention - automatically pruned
- Most detailed but largest storage requirement
- Available for recent incident analysis
Performance Optimization
Query Efficiency
- Rollups dramatically improve query speed
- Historical dashboards load faster
- Large date range analysis becomes practical
Storage Efficiency
- 96 rollup records vs thousands of raw checks per day
- Minimal database growth for long-term storage
- Cost-effective for extended retention periods
Monitoring Your Monitoring
System Health Indicators
Rollup Processing
- Check for failed rollup jobs in logs
- Verify rollup data is current
- Monitor database storage growth
Data Quality
- Consistent probe execution intervals
- Reasonable response time values
- Proper outage detection sensitivity
Troubleshooting Rollups
Missing Rollup Data
- Check background service operation
- Review rollup job logs for errors
- Verify database connectivity and permissions
Incorrect Calculations
- Validate raw data quality
- Check timezone configuration
- Review probe configuration consistency
Next Steps
- Data Model: Understand the underlying data structure
- Live Board & History: Use the web interface for analysis
- Troubleshooting: Resolve rollup and outage tracking issues
- API Reference: Access rollup data programmatically