This post will discuss the details of designing a monitoring system, which is asked in many interviews.
How to Collect Metrics — Pull or Push
There are two models to collect data, push and pull. In monitoring system, I would always go with pull model, and the reason is as below:
- Scalability Concern. Our infrastructure will keep growing, and we many have hundreds or thousands of services in the coming years. And our service usage, user base will grow too. If we go with the push model, then all these services will keep hitting our monitor service. If we have a service which processes 1M requests per second, and this service push the metrics to our monitoring service upon every request, then we will suffer from scalability issue frequently as we grow. So instead of getting called to get metrics, I would prefer to actively pull the data from the services.
- Automatic Upness Monitoring — By pulling the data proactively, we can directly know if the service is alive or not. For example, if one service is not reachable, we can be aware of it immediately.
- Easier Horizontal Monitoring — If we have two independent systems A and B, but one day we need to monitor some service in system B from system A. We can pull metrics from system B directly, no need to configure system B to push to system A.
- Easier for Testing — We can simply spin up testing env, and copy the configuration from production, then you…