1. Monitor your infrastructure. First of all, you should know what's happening with your website. If you're experienced with Prometheus/Grafana, you could use them, but if you’re not, it's not a problem; you can use any monitoring service, such as datadog, and set it up really quickly. If it's still hard, use pingdom or site24x7, at least to check that your website is still available. Remember, you can control what you want to measure, and the most important thing is that if you don't know what's happening inside your system and exactly where it's happening, you can't fix it.
Remember, there are multiple possibilities of what could go wrong when you get hit by traffic:
- You're bound by CPU resources
- You're bound by RAM limits
- You're bound by your HDD/storage performance
- You're bound by the bandwidth on your cloud instance/cluster/server
2. Prepare to scale at 60-80% of maximum load. Whenever you see that you've reached 80% of your resource limits, you should start scaling. When you reach 100%, you'll be down, and it will take time to recover (not to mention it will be very stressful). You should act fast, because you’ll be losing your users, and you might make more mistakes when you're in a hurry. When you reach 80% of your load, scale until you get it down to 40%, then repeat as necessary.
3. Keep an eye on HDD performance and bandwidth limits, not only CPU and RAM. It's harder to discover the problem when your performance is hit by IOPS (input/output operations per second) or net bandwidth limits.
4. Watch your database performance, especially when you're using a cloud database. RDS, Cloud SQL, MongoDB Atlas and other services are managed by the cloud by they have their own limits and you should watch them and scale when necessary.
5. When your DB hits a CPU check for indexes, that might really help.
Adding indexes dramatically reduces CPU load. Say you’re using 90% of your DB CPU. You might want to scale the server 2x CPU to handle 2x load, but if most of your queries are unindexed, adding indexes might reduce your CPU load by 10x, so it’s worth investigating.
6. Keep an eye on your cloud bills. It's easy to forget about your bills when you’re in a rush. Set up budget alerts in your billing system. Bandwidth is especially pricey. If you're unable to move your content to a CDN or to dedicated hosting services like 100tb.com or leaseweb, the prices are still high.
7. Avoid state in your app. Though it's possible to scale CPU and RAM resources in the cloud, there is still a limit that you can't overcome. At that point, you’ll want to scale horizontally by adding new instances of the same app—but your app should be ready for it. When you have multiple instances of the same app, your users' requests are distributed across multiple servers, so you can't store the data on a local disk.
8. Consider moving to the cloud if you're on a dedicated hosting. You can’t easily scale when you’re using dedicated hosting; it would take time to add more servers. It could take anywhere from a couple of hours to a couple of days to get new servers available, and usually you pay by the month, not by the hour. You don’t want to wait hours or days if you’re already down. It’s much easier to scale in the cloud.
9. Tune your infrastructure. There are some basic things that are disabled by default that you might want to configure in your OS, network layer, app management, and programming language manager; they might reduce your resource usage dramatically. Google for “your-tech-stack tuning” and follow the basic recommendations.
10. Be ready to start a minimal/cached version. Despite any of your efforts, if you get a 100x spike in traffic, you’ll be down. It takes time to scale up, so be ready to serve a static cached version. You might use Cloudfront/Cloudflare cache for this, or your CDN cache, nginx cache, or anything else. Just make sure that you’re able to do it when you need to.