Berlin

November 4 & 5, 2024

New York

September 4 & 5, 2024

Building a cloud architecture that can scale to any challenge

By building a cloud architecture that could meet the complex needs of international expansion, Nubank has established a more robust and flexible cloud infrastructure.
October 05, 2023

You have 1 article left to read this month before you need to register a free LeadDev.com account.

By building a cloud architecture that could meet the complex needs of international expansion, Nubank has established a more robust and flexible cloud infrastructure.

Founded in Brazil in 2013 with a credit card product, Nubank has since grown to serve 80 million global customers with a wide variety of banking products. This rapid global expansion required a technology foundation that could be flexible enough to meet a wide-range of business requirements. Here’s how we did it.

Starting from scratch

With the luxury of starting from scratch 10 years ago, this meant taking a cloud native approach and building hundreds of microservices that eventually ended up running on a tightly coupled and monolithic cloud infrastructure.

Since the beginning, a key engineering principle has been canonical approaches, consistently applied. This approach hinges on having fewer ways to solve similar problems, focusing on always evolving the system as a whole, and working towards a more standardized and homogenous tech stack.

We opted to standardize around a set of core backend technologies: Clojure for microservices, Kafka for asynchronous communication, and Datomic for high value business data, all sharing the same foundational infrastructure.

This high level standardization created a symbiotic relationship between the application and infrastructure layers. This meant that deploying standard applications to non-standard infrastructures, or vice versa, were not viable options without expending a huge amount of effort.

DoiT

Cloud architecture beyond infrastructure

This symbiotic relationship was put to the test by Nubank’s international expansion, which brought a new set of requirements beyond changes to the business logic and customer experience of before. Below are the challenges we faced and the proposed solutions on the right:

Financial services are highly regulated by the country’s central banks and other entities

Different computing environments for each country providing isolation and flexibility to meet different regulatory requirements. Making sure business and operations are decentralized and independent. 

Shift to become a global company, avoid duplication of business internal operations

Have shared services and different computing environments for global services: internal engineering and business operations tools that are not country specific. Isolation between corporate and end customer product offerings.

Time to launch a new country, be able to launch a credit card, bank account or any financial service in months.

Reproducibility and repeatability, hassle free end-to-end infrastructure setup

Other company-wide strategic goals such as acquiring new businesses, launching new products, onboarding new customers, and serving different customer segments (like financial services for small, medium, and large businesses) had presented similar challenges and infrastructure needs over the years.

One proposed solution to this increased need for flexibility was to allow for application workloads to run in different computing environments. We weren’t necessarily looking for a variety of computing environments, but we did want a selection of autonomous and independent cloud environments. We achieved this by establishing a modular and secure cloud foundation based on landing zones.

The major cloud providers like Google Cloud Platform, Microsoft Azure, and Amazon Web Services (AWS) all have their own definition of a landing zone. The idea is to make it easier for new cloud adopters to get started quickly using a well-defined framework and applying best practices.

Using AWS as a cloud provider, the way to implement a landing zone is by creating a multi-account environment. Each account holds responsibility for different system capabilities, such as logging and security.

Even as a cloud native business, the landing zone concept helped inspire our architecture design and apply recommended best practices that would allow us to distribute workloads according to compliance and regulatory requirements, unique scaling needs, or operational efficiency.

Having parts of the system distributed across AWS accounts provides complete isolation of the resources and applications running in it. On the other hand, it requires intentional and explicit additional configuration for shared resources at a granular level. This increases complexity in managing how the distributed workloads interact. This is especially true when defining the cloud network topology, making sure there is proper IP address management and routing settings, for example.

In order to leverage the isolation provided by an AWS account, think about how you want to distribute workloads across these accounts, or risk creating a bunch of unwanted boundaries in the system. For example, separating microservices belonging to the same domain would create an unwanted boundary, because they are not meant to be completely isolated, whereas we wanted complete isolation between countries.

We watched and learned over the years to avoid premature design decisions and defined some criteria for when it’s warranted to use a separate computing environment. The consequence of not having confidence in the chosen criteria is that it’s hard to make changes at the infrastructure level without an extensive migration and the associated expensive downtime, such as when a data migration needs to take place.

Countries

With international expansion, it became obvious that we needed to have complete independence for each country’s operations. Each country would have its own computing environment in one or more AWS accounts, but with the same architecture design and standards applied for each country.

These “copies” of each environment allowed for certain unique and special needs or requirements, but the standardization made it easier to get up and running. We did this by leveraging the same infrastructure as code (IaC) and automation tools, and by always bootstrapping a new country with the same standard infrastructure components. In the beginning this meant that every country had one AWS account, so they all looked the same.

Worldwide

The shift to become a global company was reflected in our infrastructure choices as a company, where, by having purpose-built computing environments to host global platforms, we were able to share systems across all countries.

This includes engineering tools, internal business operations applications, and platforms like our ETL/data processing, continuous integration and delivery (CI/CD), and security controls.

The separation between customer facing and internal software was driven by a desire to reduce duplication, limit the dependency between different country’s infrastructures and, most importantly, to minimize the change impact for our customer base.

Deployment environments

As we launched new products and the customer base grew, it became increasingly important to have even stricter isolation between deployment environments to reduce the blast radius from things that could go wrong in non-production workloads.

Even though there were segregated networks, extensive security policies, access management and authentication and authorization in place to ensure isolation of the production and staging environments, with the growth of the software fleet and the size of the engineering team, these needed to evolve. Which meant having staging and production environments in separate AWS accounts.

Bring your own cloud

This encompasses self-hosted external or third party software, where most vendors provide a set of scripts that can be run to deploy their software. It can be scary to run these in the same production environment where all your proprietary software and core business logic are, so we benefited from having a different deployment space where we could apply stricter and use-case specific security policies.

Application workloads

The strong bond between microservices and cloud infrastructure started to become a bottleneck for the growth we were seeing in the number of customers and variety of products offered. Different parts of the system started to present specific scaling needs. The overall infrastructure was designed to scale to accommodate the highest load, such as high traffic payments systems, making it more complex and costly to run lower load services, like insurance systems.

Distributing a core business logic application’s workload depends on how the system is designed. Microservices make it easier, as the system is already split into relatively small and independent pieces, but the challenge is that the domain boundaries aren’t so obviously clear and decisions need to be made based on expected system behavior. Grouping microservices within a domain boundary is a tricky challenge. There are overlaps or unclear boundaries, or just too many different ways to do it.

Reproducibility and repeatability

Having a repeatable and reproducible end-to-end computing environment bootstrap process was essential, as creating new ones would happen more often.

We first made the bootstrap process repeatable with more automation through IaC tooling and via documentation. We all know that text documentation gets outdated very quickly, but the initial process was derived from literally documenting every step based on the first time we created a new deployment space. This was crucial, as we used the step-by-step like a map to guide us to know what had to be done and improved for the future.

Having a repeatable process using the step-by-step documentation wasn’t enough though. The end-to-end bootstrap took months because the process had too many moving parts, multiple teams involved, and everything became intertwined, with unclear dependencies.

The bootstrap process had to be consistently reproducible, in a reliable and predictable way. Infrastructure provisioning was done through IaC, but some parts were still manual or relied on script-based automation, so we opted for making the IaC codebase more coherent, with explicit dependencies mapped by having well-defined components, underpinned by standards and guidelines.

Infrastructure component layers

It’s very easy to fall into the trap of circular dependencies at the infrastructure level, where you always have to provision everything because the dependencies aren’t clear. Or they just don’t allow for cherry picking what you need.

We learned the importance of creating a cohesive dependency graph, in which we have components at the bottom of the stack which don’t depend on the ones at the top.

The ones at the bottom are usually lower level and essential things like network, security, and the absolute must haves. The ones at the top are usually more “utilities” or related to specific features and capabilities that might change depending on what’s the purpose of the new computing environment.

Cloud infrastructure baselines

To ensure reproducibility and as a prerequisite to increase the number of computing environments, we defined the standards and guidelines to establish governance around cost, security, resource management, and consistency. 

This would be what we have at the bottom of our stack, consisting of:

  • Identity and access management
  • Integration with CI/CD tools
  • Networking
  • Security and audit

The above list is the set of what we call cloud infrastructure baselines, or standardized components that are present in every cloud environment. Combined with improvements on the automation and end-to-end bootstrap process, we managed to reduce the time from months to weeks, as well as reducing the complexity of having production-ready computing environments for new countries, business acquisitions, or migrations.

Final thoughts

Making room for growth in your cloud environment is as important as making individual components more scalable. For Nubank, this meant deciding what’s repeatable and what needs to be centrally managed as shared services. Then, by being able to rapidly bootstrap separate cloud computing environments for what’s repeatable, we opened up the flexibility to meet a new set of business requirements. Last but not least, by putting together well-defined modular components, standardization and flexibility can coexist within a single environment.

Our approach to international expansion enabled many different versions of the future that weren’t necessarily concerns back then, like mergers and acquisitions, and the scaling needs for our ever-growing customer base.

This has been a five year journey but still feels like the very first steps, as we made slow progress, to a multi-faceted production environment as a way to enable the flexibility required for scaling to meet any challenge the business brought us.

DoiT