Berlin

November 4 & 5, 2024

New York

September 4 & 5, 2024

AI and Kubernetes are pushing cloud costs out of control

How can you fight back against rising cloud costs while still adopting modern technologies like AI and Kubernetes?
March 25, 2024

You have 1 article left to read this month before you need to register a free LeadDev.com account.

There are plenty of small things – and one big change – that can help you regain control.

Cloud costs are escalating, adding pressure to engineering leaders also contending with an unpredictable economy, layoffs, and shrinking budgets. These macroeconomic uncertainties are driving many organizations to revisit their capital expenses and consider FinOps, an emerging discipline that seeks to better understand cloud costs and maximize investments within an organization.

The 2024 State of FinOps survey found that, for the first time, reducing waste was the top priority for FinOps practitioners, reflecting the pressure to curtail spending. The challenge for engineering leaders is figuring out where costs originate and how to respond

“Accurately attributing cloud costs has become a Sisyphean task many companies face,” says Vasil Kaftandzhiev, senior product manager at Grafana Labs. “As both the cloud and your systems evolve, it’s hard to figure out what exactly is eating up all that money, let alone optimize those costs.”

Cloud native costs are rising

Analyst firm Gartner predicts that cloud spending will reach $679 billion in 2024, growing by over 20% from 2023. Cloud system infrastructure services (IaaS) are set to grow by over 26% from 2023 to 2024.

While there is a clear need for organizations to get a handle on their cloud costs, buzzy cloud native technologies – from artificial intelligence (AI) and machine learning (ML), to Kubernetes (K8s) – seem to only be ballooning costs. At 45%, nearly half of large enterprises that are spending $100 million or more annually in the cloud say their “costs have rapidly increased” due to AI/ML workloads, according to the State of Finops survey.

And a 2023 CNCF microsurvey report found that Kubernetes has driven up spending for nearly half of those surveyed. According to Webb Brown, CEO at KubeCost, these widespread cost increases reflect the fact that existing tooling hasn’t adequately supported the shift to cloud native platforms like Kubernetes.

“Now, people are designing for multicluster from the start,” says William Morgan, CEO of Buoyant, creators of the popular service mesh Linkerd. Pair this with the goal of high availability and cloud agnosticism, and organizations are suddenly dealing with multiple clusters across multiple clouds, regions, and zones. “There’s always a balance between high availability and cost,” he says.

Reducing cloud native expenses

Engineering teams can use various tactics to reduce and optimize cloud native expenses. However, the most profound step is a cultural shift. According to Mike Fuller, CTO of The FinOps Foundation, engineering teams are typically more concerned with the performance and quality of their services than the cost impact of their decisions.

Establishing a FinOps culture, therefore, is vital to elevate cost as a more important metric in the hearts and minds of developers. “FinOps supports an organization to build policies and governance to enable cloud spending to be allocated to teams,” says Fuller. “This enables optimization recommendations to be routed to engineering teams to support them in identifying opportunities to improve cloud efficiency.”

According to the CNCF survey, overprovisioning, or having workloads using more resources than necessary, was the most common factor leading to overspending, at 70%.

Once teams understand the importance of tracking cost as a metric, how do they optimize usage? According to Kaftandzhiev, one common problem area is a disconnect between billing statements and the metrics teams collect from their Kubernetes cluster. To rectify this gap, he advises using purpose-built tools to inform your right-sizing efforts. One such add-on is Kube-state-metrics, a helpful agent that generates metrics about the health of objects within Kubernetes clusters. He also recommends PromQL, the functional query language from the popular observability suite Prometheus, to analyze time-series data.

“If you aren’t already measuring the cost of your Kubernetes fleet components and using PromQL to see the difference between the capacity you are paying for versus what you are using, that’s a really important place to start.” Additionally, auto-scaling should be in place, as well as quotas, requests, and limits. He recommends using horizontal or vertical auto-scaling to dynamically adjust resources based on demand.

After compute spending, the next area to target is data and storage. Similarly, Kaftandzhiev sees storage as a prime area for fine-tuning. “Inefficient management of storage resources, such as not reclaiming unused volumes or not optimizing storage allocation, can result in unnecessary costs too,” he says.

Brown also suggests using automation to right-size workloads and introduce dynamic scaling and rate optimizations. However, he cautions leaders not to run but walk into deploying automation in production. To enact automated optimizations, first consider looking at historical data, recommends Brown. Just like how stock brokers analyze historical data to simulate investment strategies, FinOps experts can similarly look at past performances to simulate and validate future cloud usage optimizations.

Visibility is also an important aspect of informing FinOps objectives. “Monitoring is often a core pillar toward building awareness in a FinOps presence,” says Brown. Visibility can increase accountability and enable things like internal chargeback models. It’s also important to calculate FinOps metrics, such as the cost of idle resources or the percentage of overall infrastructure that is rate-optimized, in order to track overall progress. One advanced metric to consider is normalized cost, or the total spend measurement adjusted for your operating business metrics.

Encouraging FinOps practices from the top-down

Although engineers should be aware of cost optimization techniques, a FinOps practice typically requires buy-in from the top down. According to Fuller, top-level support for FinOps signals to the organization that it’s taking cost seriously. “FinOps practitioners commonly report that better leadership buy-in would assist with FinOps success,” he says.

“Leaders can help by ensuring teams have room on their roadmaps for FinOps tasks and establishing a culture of cloud spend being considered in engineering by making decisions that consider not only performance and quality but also cost,” says Fuller. He also stresses that alignment between finance and engineering groups is essential to realize this.

Either leadership sets clear policies and goals, or you encourage ownership and motivate teams to find efficient ways to manage their own resources. “In practice, a hybrid approach often yields the best results,” adds Kaftandzhiev. “Leadership can set the vision and provide the necessary tools and policies, while teams on the ground drive optimization efforts based on their intimate knowledge of their applications and workloads.”

Embracing open standards to drive FinOps

Supporting open standards will undoubtedly be an essential ingredient in encouraging leaner habits. One such standard is The FinOps Cost and Usage Specification (FOCUS), which aims to create a normalized, universal cloud billing data format. Big names are already contributing to FOCUS, including some large cloud users, like Walmart, Meta, and Box, as well as the cloud service providers themselves, like AWS, GCP, and Microsoft. The hope is that FOCUS could significantly enhance visibility into cloud costs and aid interoperability with FinOps tools, such as cloud optimization and management platforms like Apptio Cloudability, VMWare’s Cloud Health, and CloudMonitor.

“Having a uniform approach to the data and reporting of cloud spend enables organizations to simplify the implementation of capabilities needed to track, improve, and make decisions upon cloud usage,” says Fuller. “With a common dataset, the FinOps community is better able to share knowledge and learnings.”

Kaftandzhiev also points to standardizing telemetry data with OpenTelemetry as useful within projects like OpenCost and FOCUS. “These initiatives contribute to the standardization of financial operations within cloud native environments, facilitating more transparent, accountable, and optimized resource utilization,” says Kaftandzhiev. New solutions are converging with the FinOps ecosystem, which experts anticipate will enable a more unified approach to managing and reducing cloud expenses. “Solutions like OpenCost are giving open source visibility in a way that didn’t exist before,” adds Brown.

Meeting sustainability goals

There is a natural intersection between right-sizing computing and meeting climate pledges. As such, optimization tactics, like removing inactive instances, creating more efficient code, and even programming smarter database queries, could all play a role in helping engineering teams optimize the energy consumption of the software they develop and maintain. 

Lastly, although engineering divisions are under pressure to optimize their footprints, leaders shouldn’t expect FinOps objectives to succeed without ongoing support from management. “It’s rare to see success at a big scale without some leadership support,” says Brown. “There’s technical nuance and depth, but there are real organizational and behavioral elements to this.” 

Therefore, he encourages shining the light on cost and embedding these metrics more in the day-to-day cultures and operations of both engineering teams and executive decisions.