This post was written by William Youngs, Software Engineer, Daniel Alkalai, Senior Software Engineer, and Jun-young Kwak, Senior Engineering Manager, from Tinder.
Tinder was introduced on a college campus in 2012 and is the world’s most popular app for meeting new people. It has been downloaded more than 340 million times and is available in 190 countries and 40+ languages. As of Q3 2019, Tinder had nearly 5.7 million subscribers and was the highest grossing non-gaming app globally.
At Tinder, we rely on the low latency of Redis-based caching on Amazon ElastiCache for Redis to service 2 billion daily member actions while hosting more than 30 billion matches.
In this approach, when our app receives a request for data, it queries a Redis cache for the data before it falls back to a source-of-truth persistent database store, Amazon DynamoDB (though PostgreSQL, MongoDB, and Cassandra, are sometimes used).
Before we adopted Amazon ElastiCache for Redis, we self-managed our Redis workloads on Amazon EC2 instances. Our self-implemented sharding solution functioned reasonably well for us early on. However, as Tinder’s popularity and request traffic grew, so did the number of Redis instances. This increased the overhead and the challenges of maintaining them.
Motivation to migrate
We began exploring new solutions first because of the operational burden of maintaining our sharded Redis cluster on a self-managed solution. It took a significant amount of development time to maintain our Redis clusters. This overhead delayed important engineering efforts that our engineers could have focused on instead. For example, it was an immense ordeal to rebalance clusters. We needed to duplicate an entire cluster just to rebalance.
Second, inefficiencies in our implementation required us to overprovision our infrastructure, increasing cost. Our original sharding algorithm was inefficient and led to systematic issues that often required developer intervention. Additionally, if we needed our cache data to be encrypted, we had to implement the encryption ourselves.
Finally, and most importantly, our manually orchestrated failovers caused app-wide outages. We would often lose connection between one of our core backend services and our caching node because of this failover. Until the application was restarted to reestablish connection, our backend systems were often completely degraded. This was by far the most significant motivating factor for our migration: before our migration to ElastiCache, the failover of a Redis cache node was the largest single source of app downtime at Tinder. To improve the state of our caching infrastructure, we needed a more resilient and scalable solution.
Investigation of new solutions
We decided fairly early that we should consider fully managed solutions in order to free up our developers from the tedious tasks of monitoring and managing our caching clusters. We were already operating on AWS services and our code already used Redis-based caching, so deciding to use Amazon ElastiCache for Redis seemed natural.
We performed some benchmark latency testing to confirm ElastiCache as the best option and found that, for our specific use cases, ElastiCache was a more efficient and cost-effective solution. The decision to maintain Redis as our underlying cache type allowed us to swap from self-hosted cache nodes to a managed service nearly as simply as changing a configuration endpoint.
ElastiCache satisfied our two most important backend requirements: scalability and stability. Previously, when using our self-hosted Redis infrastructure, scaling was time consuming and involved too many tedious steps. Now we initiate a scaling event from the AWS Management Console, and ElastiCache takes care of data replication automatically. AWS also handles maintenance (such as software patches and hardware replacement) during planned maintenance events with limited downtime. In addition, we appreciate the data encryption at rest that ElastiCache supports out-of-the-box.
Finally, we were already familiar with other products in the AWS suite of digital offerings, so we knew we could easily use Amazon CloudWatch to monitor the status of our clusters.
After implementing our migration strategy and addressing some of our original concerns, reliability issues plummeted and we experienced a marked increase in app stability. It became as easy as clicking a few buttons in the AWS Management Console to scale our clusters, create new shards, and add nodes. The Redis migration freed up our operations engineers’ time and resources to a great extent and brought about dramatic improvements in monitoring and automation. For a technical deep dive on our migration strategy for moving to Amazon ElastiCache, check out Building Resiliency at Scale at Tinder with Amazon ElastiCache.
Our functional and stable migration to ElastiCache gave us immediate and dramatic gains in scalability and stability. We could not be happier with our decision to adopt ElastiCache into our stack here at Tinder.
The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.