Payment gateway outage - my lesson learned for ticket sales platform

December 19, 2024

Do you know what the biggest thing that product managers of ticket sales platforms need to pay attention to when launching events is?

LOAD!

Similar to e-commerce platforms approaching Black Friday, Boxing Day or any other huge discount days, ticket sales platforms may crash due to the sudden spike in traffic load.

However, it is not only about scaling up the servers to accommodate more users on a website. You will also need to consider the payment gateway's ability to handle transaction load and its fraud alert system.

To understand this, let's look at my big lesson learned 2 years ago at Cadillac Fairview when selling Santa Visits tickets.

Background

Cadillac Fairview had been switching vendors to find a ticket sales platform that could satisfy all its unique requirements for this Santa Visits event. It is a flagship annual event.

All the vendors in the past have their strengths and weaknesses, yet no one seems to be able to provide a good experience for both the customers and the admins.

Over the years, it became a superbowl project with a lot of eyes on it. The urgency to find a good platform is pressuring.

Long story short, my director brilliantly found a potential vendor; and we had a successful Proof-of-Concept. We decided to go ahead with this vendor, even until today.

Architecture Context

This new vendor (let's call it "A") is a ticket sales platform that does code customization. That is not the only different thing about A compared to other platforms.

A also does not have its own payment gateway. Rather, it uses a 3rd-party payment gateway in an iframe.

This means that from the customer's point of view, they only stay on the ticket sales platform to complete the reservation. However, transactions are processed by the 3rd-party payment gateway. In fact, platform A does not capture any payment details as they are entered in an iframe.

And it is the first year we had this setup.

Traffic volume forecast

It is almost second nature for dev teams of platforms and other digital products to prepare for the traffic load of launch days.

Think about tickets to Taylor Swift: The Eras Tour concerts, or flight discounts. Without any measures to control traffic, the amount of people accessing these sites is huge, which could potentially bring down the servers. This first metric is called The number of Concurrent Users.

Along with the number of concurrent users, we also need to forecast the number of page views and events because every hit is a call to a server or some API.

On a side note, companies can employ different methods to regulate the traffic. For example, Ticketmaster for the Taylor Swift concerts implemented a lottery system and only invited a random pool of users that expressed interest to purchase the tickets. Airlines can use solutions like QueueIt to place users in queue as they enter the websites to purchase discounted flights.

I forecasted all these metrics based on previous years data. However, to be really safe, we also had a cushioning buffer by multiplying all the numbers by 5.

Things were great, at least on the platform A.

So, what went wrong?

I made a huge assumption that, since the bank and the payment gateway that we partnered with proccess millions of transactions on a daily basis, it would be least of our concerns.

Payment Gateway Outage

Out of all the properties that Cadillac Fairview owned, we had the event set up at 16 commercial properties. And the business decision is to release the tickets of 16 properties at the same time, to create a big-bang hype effect.

So, we all sat in a war room at 10 AM. Email Marketing team deployed newsletters to the subscribers of the 16 properties, who are surprisingly "hungry" for this type of tickets. We were cautious about the load.

Half an hour went by without an issue as we monitored Google Analytics, the payment gateway portal, ticket sales, and the email inbox viathat customers will contact us.

Then, suddenly, transactions started getting declined intermittently, which soon became a total outage.

Root causes

There were 2 major factors:

Lessons Learned

Since then, whenever we launched an event, regardless of the size of the event, I always do 2 things:

  1. Forecast the max number of transactions per second along with the load
  2. Inform both the bank and the payment gateway of the date and time of launch, event duration, number of expected transactions, order values, and max number of transactions per second.

I hope you can learn from my past mistake in order not to commit it yourself.

Read more blog posts