Performance issue

Incident Report for Teamleader

Postmortem

On Friday May 18th, Teamleader was unavailable for about 30 minutes. We are very sorry this happened and we know this might have caused you some trouble. That’s why we’d like to explain what happened exactly and what we will do to prevent this in the future.

What happened?

A setting in our Amazon Web Services will disallow new connections once the number of connections reaches a certain amount. In this case, the server will not return any answers. When this happened, Amazon was no longer able to perform health checks on our machines. As a result, it was perceived as an unhealthy machine, and there was no more traffic possible.

This caused all requests (a click, an action, a new page loading…) to overload the queue which caused Teamleader to be unavailable for a time.

How did this happen?

We performed a migration at the moment of the downtime. The connections to our database remained open for longer than normal. At the same time, one of our endpoints received an unusually high amount of requests, which caused us to reach the limit of the setting we spoke of earlier. This was an error from our side: we were a bit careless with this network setting, which is unacceptable.

How did we solve this problem and how will we prevent it?

Immediately after we noticed something wrong, our team got in contact with Amazon Web Services’ support team to fix the issue as fast as possible.

We have switched off the setting that caused us trouble. Now there is no more hard limit on requests, but a soft limit. This is better in situations in which we get a higher number of requests. Advised by Amazon’s support we edited two settings back to default. We have also implemented extra monitoring to help us understand this problem better and help us prevent this before it impacts our customers.

We thank you for keeping your trust in Teamleader. We know you often count on us for your daily business, so we’re very sorry that this might have caused you any trouble.

Kind regards,
Teamleader

Posted May 22, 2018 - 16:40 CEST

Resolved

The performance issue has been completely resolved after some careful monitoring. More explanation on this issue can be found in our postmortem post.
Again our sincerest apologies for this problem.

Posted May 22, 2018 - 16:39 CEST

Update

We are continuing to monitor for any further issues.

Posted May 18, 2018 - 18:10 CEST

Monitoring

Teamleader is currently stable again. However, we are monitoring very closely. We will keep you updated on any further progress. Thank you for understanding.

Posted May 18, 2018 - 12:56 CEST

Update

We are continuing to work on a fix for this issue.

Posted May 18, 2018 - 12:44 CEST

Update

Update: we're currently in contact with our server provider. We've identified the network connectivity issues and are now in the process of rebooting and adding new servers to get Teamleader up and running as fast as possible.

Posted May 18, 2018 - 12:35 CEST

Identified

We are still noticing some issues for certain accounts. Our team is working hard on fixing the issue.

Posted May 18, 2018 - 12:25 CEST

Monitoring

Teamleader is accessible again. We're closely monitoring the situation and are looking into the issue further.

Posted May 18, 2018 - 12:50 CEST

Identified

The issue has been identified and a fix is being implemented. Our apologies for the inconvenience.

Posted May 18, 2018 - 12:17 CEST

This incident affected: Teamleader Focus Web App, Cloud Platforms (Ticketcloud, Projectcloud, Cloudsign), and API and integration services (API endpoints).