At Softeta, we’ve been running cloud infrastructure for enterprise clients since we started in 2020. Across 50+ engagements, I noticed that the teams that struggle at scale are almost always fighting an organizational problem with a technical tool.
Enterprise scaling breaks at the coordination layer
When a startup goes down for twenty minutes, they post an apology on X.com. When an enterprise bank goes down for ten, it triggers SLA penalties plus a formal incident report that gets reviewed by compliance officers who’ve never opened a terminal.
At SEB, one of the largest banks in Northern Europe, five payments teams had each built their own QA process when we started in mid-2022. Duplicated test suites, UI-heavy testing that didn’t fit their microservice setup.
Getting all of those teams to agree on how testing should work is harder than it looks. We aligned them under one QA strategy with component level testing and shared end-to-end suites.
In our work at SEB, bug detection at the component level became 70% cheaper than catching those same issues in production. That’s a coordination outcome and it showed up as a technical metric.
Your usual cloud architecture articles tend to skip this part and jump straight to service decomposition patterns. Thing is, your architecture can only be as good as the organization running it.
Then there’s the compliance layer. A European financial client might need their transaction data in Frankfurt, their user data processed under GDPR, and audit logs kept for seven years, so your “cloud migration” turns into a hybrid setup because legal and procurement each made their own decisions before engineering was in the room. Believe it or not, this is very common for large European companies.
Cloud costs grow quietly, then somebody in finance notices
Flexera’s 2025 State of the Cloud Report puts wasted cloud spend at 27%. The 2026 followup bumped it to 29%, and it’s the first increase in five years. That trend clearly shows the problem isn’t getting solved by default and businesses are getting worse at it despite more tooling than ever.
The waste pattern often looks like this: dev environments running around the clock on production-sized instances, autoscaling minimums set after a traffic spike six months ago and never revisited. Individual charges look small on their own. Then finance pulls a quarterly report and asks why you’re at 180,000 EUR when the budget was 120,000 EUR.

The fix is straightforward: tag every resource with the owning team and build a spend dashboard by service. And yet, you’d be surprised how many companies don’t do this.
When engineers can see that their service costs 8,000 EUR a month and most of it is an oversized RDS instance, they fix it on their own.
In some of our client work, systematic tagging and weekly reviews have cut infrastructure costs by 30% and IT costs by 40%. Those numbers don’t come from a one time audit, but from making the costs visible to the people who control them.
Reserved instances for steady state production workloads save real money too. Spot instances work well for batch jobs and CI/CD runners that can handle interruptions.
Splitting services is a business call
Does your payment processor need to share a database connection pool with your PDF generator? If one spikes, do you want both going down?
I’d be shocked if the answer is yes.
Our client PortalPRO connects property managers with vetted contractors. They needed pricing and email parsing handled alongside chatbot queries, and those workloads don’t scale the same way. We split them into separate services on .NET Core with Kafka handling the async layer. They expanded into Spain and Portugal without expanding the team, and their manual request handling dropped by 30%.
We pair relational databases for transactional work with DynamoDB or Elasticsearch on the read side. Your reporting team will lose their cross-service JOINs and they’ll complain about it. A single Postgres instance bottlenecking all your services is a worse outcome. And splitting the databases forces a conversation about data ownership that most organizations have been avoiding.
On one e-commerce project, separating the services eliminated outages that kept occurring when marketing pushed a campaign. The campaign spiked the frontend, the frontend choked the backend, and everything went down together. Once the said services were independent, the chain reaction was eliminated for good.
Of course, not every workload needs this treatment. If you have three services with stable traffic, you’d be adding complexity for no good reason. Ultimately, the split is a business decision and should not be treated as an architectural default.
If your infrastructure lives on one laptop, nothing else matters
One of our mid-sized clients told me a remarkable story. They had a “senior” engineer who ran terraform apply from his machine. He was the only one who knew the production state and the only one who could make changes. Then he took a summer vacation and, predictably, infrastructure changes froze for weeks.
Unfortunately, versions of this story are not that rare.
HashiCorp’s 2024 Cloud Strategy Survey found that 90% of organizations use provisioning tools, but only 8% qualify as highly mature in cloud operations. The gap between “we use Terraform” and “we actually operate well” is where most teams live.
The point of Terraform is that everything goes into version control and someone must review the diff before someone else hits apply. ArgoCD on the Kubernetes side watches the Git repo and flags any configuration drift. That way you no longer have to guess who changed what in production.
For a Baltic retail group with 180+ stores and 2,200 employees, we modernized an ERP running on .NET Framework with no automation pipelines. After moving to Azure DevOps with version control, security vulnerabilities dropped by 90% and deployments got substantially faster.
You either have infrastructure defined in a Git repo the whole team can work with, or you have one person who is the single point of failure for your entire production environment. The second one always costs more than it should.
Kubernetes earns its complexity past a certain threshold
The case for Kubernetes is simple: past a certain scale, nothing else does what it does. The case against it is equally simple: that threshold is higher than most people think.
We use managed services like EKS or GKE because running your own control plane wastes engineering time unless compliance forces it. Even with managed, there’s a lot to get right before your first real deploy.
DORA’s 2025 report found that 90% of organizations now have some kind of internal platform. That tells you the problem is doing it well rather than adoption alone.
Namespace isolation and resource quotas are things we set up from day one. If one team’s runaway pod can starve every other service on the cluster, you’ll find out at the worst possible moment.
A trap we keep seeing is autoscaling that hasn’t been load-tested. “Scale at 70% CPU” sounds fine, but then peak traffic arrives, and it turns out the new pods take 45 seconds to come up while the load balancer needs another 30.
That 75-second gap crushes the existing pods. For that reason we load-test every autoscaling config before it goes to production.
Kubernetes is also overkill for a lot of workloads. Got three services with steady traffic? ECS or Cloud Run gets you there at a fraction of the operational cost.
Multi-cloud is almost never something you planned for
Most multi-cloud setups aren’t by design, but rather by inheritance. Company B gets acquired by Company A. One is using AWS, the other Azure. Now you have two clouds with two billing accounts, and two engineering microcultures unwilling to merge. But hey – the integration timeline was supposed to fix this, right?
HashiCorp reports that 91% of enterprises waste money in the cloud. Multi-cloud makes it worse because you’re duplicating identity management and monitoring across providers with different abstractions.
The obvious advice is to stick with one provider whenever possible. Most lock-in comes from managed services like RDS and DynamoDB, and those are worth keeping because they shrink your operational surface. Containerized workloads transfer between clouds more easily than VM based ones if you ever do need to move.
At QarView, we stayed on Azure for their fleet management SaaS. Four engineers delivered a multi-tenant platform in twelve months, and manual fleet operations dropped by 70%.
Staying on one cloud was one of the reasons that such a small team could ship extremely fast. When you’re not managing two sets of IAM policies and two monitoring stacks, you can put that engineering time into the product.
Terraform helps if you do end up multi-cloud, because you manage resources across providers with one workflow. But I’d still push hard to avoid it if you can.
Security problems compound faster than any other kind of debt
We start every engagement with zero trust networking now. The old “everything inside the VPC is trusted” model stopped working once service counts grew past a handful. We authenticate every service-to-service call, either with mTLS through Istio or with IAM roles.
IAM is where the real damage piles up quietly. Someone can’t deploy, gets AdministratorAccess as a temporary fix, and that permission is still there two years later.
We run quarterly IAM audits going through every role. But to be honest, I’m not sure monthly would be overkill.
On our car leasing and fleet management engagement that was ECB-regulated with operations in 30+ countries, we put 150 controls in place addressing more than 50 key risks. Building for compliance from day one is what makes it manageable.
Retrofitting SOC 2 or ISO 27001 onto a system that wasn’t built for it is engineering hell. So start the compliance work on day one if you know an audit is coming. That way it’s cheaper and you will save yourself a lot of headaches.
You need observability before you think you need it

When a customer reports that checkout is slow, you need to trace that request through a dozen services and find the delay in a sequential scan on a 50 mil row table. That’s what observability looks like in practice. Dashboards with green lights just aren’t it.
Datadog handles our metrics on most projects. For clients who are self-hosting, we use Grafana with Prometheus and OpenTelemetry for the distributed traces.
On one project, retrofitting tracing across 30 services took months. Adding it to each service as you build is an afternoon of work. The difference in effort is hard to argue with once you’ve lived through the retrofit.
It also changes how your team handles incidents. You stop guessing which service is slow and restarting things until the problem goes away. You look at a trace instead.
There’s one additional rule we enforce: if the on-call team hasn’t acted on an alert in the past month, we straight up delete it. Noisy alerts train people to ignore the real ones.
Start with the boring stuff
Go pull up your cloud console and count how many of your roles have AdministratorAccess. Then check if your dev environments are still running on production sized instances.
Look at where your last terraform apply came from. If it ran from someone’s laptop, that’s worth fixing before you do anything else.
If any of those answers bother you, that’s where you start. The things that actually matter at scale are seldom architecturally interesting. But they’re also the things that silently compound to the point of breaking you when you skip them.