Today we're going to go over a few major issues that occurred in the last few weeks. This topic will be fairly short and just detail the issues and how we resolved them.
In all cases, all account data has remained safe and secure. This only affected the availability of our website.
15 July 2018 at around 8:40 AM (UTC-4)
For roughly two hours, users were greeted to our website's version of a blue screen of death: i.imgur.com/8GhYbWQ.png
This was completely unexpected and was only made aware once we had woken up that morning to dozens of notifications about our service being down. After investigating, we found that our database had run out of memory randomly and after researching why this had happened, we came to the conclusion that we needed to introduce more memory to the server.
This issue made our website, multiplayer, the update checker, documentation and Cards Against Lucas unavailable. We're extremely sorry about this.
In response to this issue:
- We introduced a new process on our servers to automatically restart our database software if anything goes wrong. It will also publicly log and notify us when this happens so we can investigate and be aware quicker.
- We increased the amount of memory our server is able to use.
- We fixed our status page to handle the entire database being down.
18 July 2018 at around 4:39 PM (UTC-4)
I'd like to explain why our forum was unavailable from 4:39 PM (UTC-4) on 18 July 2018 to 1:05 AM (UTC-4) on 19 July 2018. This was completely unexpected and was done on purpose because a problem was discovered by a new user. The last 30 or so members to register had trouble confirming their account, which was later made aware to us.
The issue that we discovered was concerning enough to us that we felt the need to take down the website temporarily to work on the issue. This issue also had to do with a third-party provider, as such we had to work with them to solve the issue.
Affected users can request a new code using the verification page linked at the top of the page.
In response to this issue:
- We worked with our third-party provider to identify and resolve the issue. This took longer than anticipated because of misinformation provided by them. In the end we were able to bring back the site back up.
- We fixed an issue with our website where users weren't able to modify their account settings before they're verified.
- We added in safety checks to our code.
- We prevent users from requesting more confirmation emails if they already have done once within the last 2 hours.
- We're now discussing how to handle users who have not confirmed their accounts after a certain amount of time.
Our current website is built on code we created between 5 and 8 years ago, which has proven why we've had difficulties supporting the thousands of people that use this site every week. We're working on creating a new website, however supporting both websites simultaneously has proven difficult which is why it's taking longer than anticipated.
We know we need to do better and try to keep our service stable, but we hope you understand why we've had these downtimes.
Sincerely,
Jake Andreoli