Our journey began against the backdrop of an ever-increasing dataset. Our daily metric count ballooned from 500,000 to 22 million within 15 months, amassing over 4.7 billion records, while the number of customer servers increased 5X to a few thousands.
Currently, our database size stands at 300 GB, growing rapidly, with increasing costs to maintain its fast performance. The extensive database required continuous server resource augmentation due to the need to analyze an increasing number of rows for metric aggregation operations, data selection for analysis and training, and user charts.
Metric aggregation for the past hour is performed every hour, calculating minimum, average, and maximum values. With data growth, aggregation time increased to 45 minutes, leading to issues generating aggregated data over extended periods.
The increasing metrics displayed on the Releem Dashboard led to a slowdown in metrics selection and increased disk load because not all data could fit in RAM. The slowdown in metric selection adversely impacted the user experience with the Releem Dashboard.