651
points
Last week we at n8n ran into problems getting a new database from Azure. After contacting support, it turns out that we can’t add instances to our k8s cluster either. Azure has told they'll have more capacity in April 2023(!) — but we’ll have to stop accepting new users in ~35 days if we don't get any more. These problems seem only in the German region, but setting up in a new region would be complicated for us.
We never thought our startup would be threatened by the unreliability of a company like Microsoft, or that they wouldn’t proactively inform us about this.
Is anyone else experiencing these problems?
You're new to Azure I guess.
I'm glad the outage I had yesterday was only the third major one this year, though the one in august made me lose days of traffic, months of back and forth with their support, and a good chunk of my sanity and patience in face of blatant documented lies and general incompetence.
One consumer-grade fiber link is enough to serve my company's traffic and with two months of what we pay MS for their barely working cloud I could buy enough hardware to host our product for a year of two of sustained growth.
Too much capacity is money spent getting no return, up front capex, ongoing opex, physical space in facilities etc.
On cloud scales (averaged out over all the customers) the demand tends to follow pretty stable and predictable patterns, and the ones that actually tend to put capacity at risk (large customers) have contracts where they'll give plenty of heads-up to the providers.
What has been very problematical over the past few years has been the supply chains. Intel's issues for a few years in getting CPUs out really hurt the supply chains. All of the major providers struggled through it, and the market is still somewhat unpredictable. The supply chain woes that have been wrecking chaos with everything from the car industry to the domestic white goods industry are having similar impacts on the server industry.
The level of unreliability in the supply chain is making it very difficult for the capacity management folks to do their job. It's not even that predictable which supply chain is going to be affected. Some of them are running far smoother and faster and capacity lands far faster than you'd expect, while others are completely messed up, then next month it's all flipped around. They're being paranoid, assuming the worst and still not getting it right.
This is an area where buying physical hardware directly doesn't provide any particular advantages. Their supply chains are just as messed up.
The best thing to try to do is do your best to be as hardware agnostic as is technically possible, so you can use whatever is available... which sucks.
It is true I can get an instance most of the time, but not if I need >16GiB GPU memory.
As far as chip shortages, it probably helps that Amazon makes its own chips. Microsoft could do the same rather than running out of capacity and blaming chip shortages.
Microsoft had to know that at some point they were going to run out of capacity. They should've either did something about it or let customers know.
Even when Microsoft was being open about Azure having difficulty getting Intel chips, AWS, GCP etc. were in the same position and just not really talking about it. From my time in AWS there were some other times when some services with specialised hardware came really, really close to running out of capacity and had to scramble around with major internal "fire drills" against services to recoup capacity.
Most people won't run in to these issues, the clouds all tend to be good at it, but they still happen.
There are also advantages of the economy of scale and brand recognition. The more customers you have the more the capacity trends smooth out, the easier it is to predict need, even if you're still stuck with uncertainty on the ordering side.
If anything, I’m surprised we can just spin up a few hundred instances out of nowhere and not run into capacity issues.
MS also makes all sorts of crazy deals and commitments, and I wouldn’t be surprised if being collocated with a strategic customer may lead to local shortages of resources.
IDK what chips you are talking about, all x86 (which I assume is most of their compute) is Intel or AMD. If they make their own they are only making the ARM instances.
https://aws.amazon.com/silicon-innovation/
Yup. And a few of the OEMs have stopped talking about supply chain integrity. Many folks have observed more memory and power supply problems since the pandemic.
Spot instances exist just to try to turn over-provisions in to not a complete loss. You're at least making some money from your mistake.
edit: You should consider "spot instances" in general to be a failure as far as a cloud provider is concerned. It means you've got your guesses wrong. You always want a buffer zone, but not that much of a buffer zone. The biggest single cost for cloud providers is the per-rack OpEx, the cost of powering, cooling etc.
Wasn't the whole point of "the cloud" that these things shouldn't happen?
https://azure.microsoft.com/en-us/blog/summary-of-windows-az...
I've worked with both Azure and AWS professionally and both have had their fair share of "too many outages" or capacity issues. At this point, you basically must go multi-region to ensure capacity and even better if you can go multi-cloud.
I don't believe that is even remotely correct.
It isn't the pricing you should be worried about but the staffing, redundancy, and 24/7 operations staff.
I'm dealing with AWS and on-prem. On-prem spent some $5M to build out a whole new setup, took literal multiple months of racking, planning, designing, setting up, etc.
It's not even entirely in use because we got supply chain issued for 100 Gbit switches and they won't be coming until at least April of 2023 (after many months of delays upon delays already).
TAMs tend to be a bandaid organizational sign that support-as-normal sucks and isn't sufficient to get the job done (ie fix everything that breaks and isn't self-serve).
Otherwise, especially if there’s a broader problem, they play lots of games with SLAs, etc.
Their support was also amazing in the beginning.. but after they hooked you up... you're just a ticket in their system. Takes weeks to do fix something you could fix in minutes on-prems or that their black belt would get fixed in a very short amount of time in the beginning of the relationship.
Cloud isn't that magical unicorn!
Another advantage of not having to own the hardware is that it's easier to scale, and get started with new types of services. (i.e, datawarehouse solutions, serverless compute, new DB types,..).
I'm not trying to advocate for or against cloud solutions here, but just pointing out that the decision making has more factors apart from "hardware cost".
In the past 2 or three years, we probably moved more services off the cloud than other way. That said one reason for that is that most new services are build in the cloud, so there are less services off the cloud than on it.
Cloud is best, when you are starting out, when you don't know what you need, need high velocity of adding new stuff, of have very burst like demand for either traffic or cpu etc. Or if you are just small developer only team.
But if you have applications that are relatively stable, are mostly feature complete and you don't expect much sudden growth etc, it's useful to run the numbers if cloud is still something you want/need.
> setting up in a new region would be complicated for us.
Sounds to me like you've got a few weeks to get this working. Deprioritize all other work, get everyone working on this little DevOps/Infra project. You should've been multi-region from the outset, if not multi-cloud.
When using the public cloud, we do tend to take it all for granted and don't even think about the fact that physical hardware is required for our clusters and that, yes, they can run out.
Anyways, however hard getting another region set up may be, it seems you've no choice but to prioritize that work now. May also want to look into other cloud providers as well, depending on how practical or how overkill going multi-cloud may or may not be for your needs.
I wish you luck.
(As an aside I also agree that multi cloud from the get go is a YAGNI violation. Just keep in the back of your mind “could we have an alternative to this?” when using your provider’s proprietary features.)
Just having the plan is already expensive enough.
None are ideal.
Multi-cloud is really not a big deal. Main nuisance is billing differences, followed by slight variations in e.g. Terraform config.
Multi-cloud should only be for mission critical infrastructure. Very little infrastructure is mission critical. Most other use cases can be temporarily wallpapered over with an "Under maintenance" page unless there's a good reason otherwise.
Multi-cloud introduces more risk than it prevents. Which is why things like simulated failovers and BCP testing is constantly required.
I'm totally on board with the idea of being scrappy and taking shortcuts in order to get to PMF as soon as possible. However, it seems the proof is in the pudding here. If you can't service customers due to lack of compute resources, you can't get to PMF.
Also, yes there are certain infrastructure and network topologies that would absolutely be overkill for a young startup. I don't think multi-region is one of those things. I don't have experience with Azure directly, but on every other cloud providers, going multi-region is not something that requires huge amounts of time or resources. You just need to be mindful of it from the outset. And if you decide not to be, then at least be intentional and conscious about the risk and have a plan in place for what happens when you get bit by deciding not to go multi-region.
Sounds like customers are coming in thick and fast.
If this is the dynamic and the company can't spare a few weeks to solve it, something has gone seriously wrong in a very interesting way.
Also, lots of companies assert GDPR compliance via magical thinking. They most often are wholly wrong. Shopify can say whatever they want, but there’s no certification body.
Source: I’m the person who evaluates and builds compliance systems for a range of services you almost definitely use.
I don't want to be snarky, but when large service providers like AWS have their own crossregion downtime because one snowflake of a service in us-east-1 is down, I kind of dismiss the virtue signaling of high resilient multi-(az/region/cloud) ever existing in practice.
If you can somehow have a separate database per region/cloud, sure, I can understand that, but if you have to shard your database across many clouds, I'd dread having to tame such a beast, especially within a startup.
So you're saying it's impossible to improve reliability from 97% to 99% because you can never make it to 100%.
Multi-AZ and multi-region add complexity and cost much more quickly than they add reliability.
Sometimes it is worth it. Sometimes it is not.
Without knowing the details about your services and infrastructure, it's hard for me to know what's involved in going multi-region now. Are you sure it's such a a gargantuan effort? I would've thought one person working full-time on this for a week or two would be enough, but again I don't know the details of your setup.
One option would be to pay a consultant who is an expert in Azure/cloud stuff to come in and help. May not be cheap, but could be a lot better and quicker for you and better for the business, especially if none of you are really big experts in Azure.
I've been here before (I think)...had to wear many hats and scramble to make sales, build the tech, act as de facto DevOps person even without a lot of experience doing it, etc. That is the way, but stuff happens.
Happy to chat about specifics if you want to bounce ideas off of me or go through your particular situation. Can't promise I'll have concrete advice, but happy to talk it through.
Very poor position to be in, apparently this happened in azure UK recently too.
Let the scar you get from this is be a learning experience, hopefully you will not fall into the same trap again to trust this company.
In my career I'm in a place where anyone suggesting I do work on Azure gets an instant doubling of my asking day-rate and I really hope the will be put off and find another victim for this gig.
That said, another learning experience would be to use terraform or something (tbh for azure the only sane thing is terraform, ARM templates are just garbage). Having terraformed your one region switching to the other would be much easier, tho not trivial.
If you rely on Kubernetes for orchestration and have minimal cloud API dependency, it may be worth that evaluating this option.
Also, do you have a TAM associated with your account? Are you just going through regular support channels? Can they deliver different instance types (not sure what the Azure parallel is), can they deliver short term capacity, etc?
I would try to push Microsoft more here. It's not like they've stopped on-boarding new customers into that region right? What happens if you create a new account in that region?
If you don't you're at a big disadvantage.
Similarly, just keep trying to change the size. Often it’ll go through when someone else decommissions something.
The way to get more from most cloud is by becoming a partner, not just a customer. And the way to do that is increase dependency and usage.
This is doubly worthwhile as if this stumble kills the startup (it can happen) this will be excellent experience to take to the next employer :)
Everyone? That's not going to help.
The biggest advice I can give is 1. keep trying and grabbing capacity continuously, then run with more than what you need. 2. Explore migrating to another Azure region that runs less constrained. You mention a new region would be complicated, but it is likely much easier than another cloud.
1. https://www.zdnet.com/article/azures-capacity-limitations-ar...
This is also a problem internally for Microsoft. GitHub and LinkedIn still operate in private datacenters due to Azure capacity issues
... wait, what? How are they defining 'reserved'?
Dedicated capacity exists, but it’s different (compute reservation groups or dedicated hosts).
You can combine CRG/DH with RI for the desired effect, although IMO it’s a bit confusing.
(Azure employee)
The billing thing became more of the point as big AZ failures are so rare.
As ridiculous as it sounds, having an enterprise's applications exist on multi-cloud isn't terrible if the application is mission critical - not only does this get around Azure's constant provisioning issues but protects an organization from the rare provider failure. (Though multi-region AWS has never been a problem in my experience, there is a first time for everything.) Data transfer pricing between clouds is prohibitively expensive, especially when you consider the reason why you may want multi-cloud in the first place (e.g., it's easier to provision 1000+ instances on AWS than Azure for an Apache Spark cluster for a few minutes or hours execution - mostly irrelevant if your data lives in Azure Data Lake Storage).
Building on cloud requires a lot of trade offs, one being a need for very robust cross-region capability and the ability to be flexible with what instance types your infrastructure requires.
I’d use this as a driver to either invest in making your software multi regional or cloud agnostic. Multi regional will be easier. If you’re already on k8s you should have a head start here.
The major cloud services are expensive. This extra cost is supposed to provide for cloud services' high level of flexibility. Running out of capacity should be a rare event and treated as a high priority problem to be fixed asap.
Without the ability to rapidly and arbitrarily scale, they're just overpriced server farms.
I mean that's what cloud is (outsourced server farm). Sure they also offer services on top, but that's mostly because they want to lock you in, and can charge more for, so it's a win win for them.
And there is no magic here, someone has to get the chips, build servers and connect them to network. And while they will often overbuild for capacity, they will never do it to a degree, where they can't run out, because that would be way to expensive and not financially viable.
I don't think any cloud will ever be able to guarantee to never run out of resources.
I agree with this, but clearly there's a disconnect between how often people expect these kinds of issues and how often they actually happen. The whole point of the cloud is you pay a premium for the added flexibility. If it turns out that flexibility isn't there when you need it then maintaining your own servers becomes a lot more attractive.
They can't magic chips into existence, but leaving a major region like Germany high & dry for almost half a year sounds like planning went wrong frankly. If it were a matter of chips I would have thought on a 3+ month timescale they can steal a few from another region that has a bit of fat
That's exactly what a cloud is. It's someone else's datacenter with an API.
Ideally you have a script that goes from credentials to the service to a complete working instance.
Instead of providing you with a list of the resources they do have, you have to play this weird game where you ask for specific instances in specific regions and then within several hours someone emails back to say yes or no.
If it’s no, you have to guess again where you might get the instance you want and email them again and ask.
I envisage going to an old shop, and asking the shopkeep for a compute instance in a region. He hobbles out the back, and after a long delay comes back and says “nope, don’t have no more of them, anything else you might want?”.
It’s surprising this how it works. Not the auto scaling cloud computing used to bring to mind.
I just Googled it. Gartner estimated $125 billion.
https://www.fiercetelecom.com/telecom/cloud-and-colocation-d...
you have to do this for every single instance type they have, can't even experiment or test other instance types cause its too much trouble to get quota
21st century man…. it’s coming.
Computers don't fix everything. They just allow you to f*ck up bigger, harder, and faster, usually in the most banal way imaginable.
> Yes it’s weird that you have to ask them for instances which some actual physical person looks at your request, thinks about it and says yes or no to.
and low quota is low, like 10 cpu, so start a 2 node k8s cluster with 8cpu each? nope, go request quota increase
And sometimes, that is hard. I've had Azure support not able to understand what quota they need to raise / what quota is being requested. I had to at least link them to their own documentation on it… (partly the confusion is that quota support tickets allow selecting the quota as a piece of metadata on the ticket, but only for some quotas, and of course, mine was for one of the ones not listed. Why they don't just list all of them is anyone's guess.)
The documentation is terrible and the Azure portal is so slow and laggy I can’t even believe it. Not to mention how unreliable their stack is.
In particular, GPU availability has been a continuing problem. Unlike interchangeable x64 / arm64 instances with some adjustments based on the new core and ram count... if no GPU instances are available then I simply cannot run the job. AMD's improved support has increasingly provided an alternative in some situations but the problem persists.
I recommend doing the work to make the business somewhat cloud agnostic, or at the very least multi-region capable. I realize this is not an option for some services that have no equivalent on other clouds but you mentioned databases and k8s clusters which are both supported elsewhere.
All cloud providers charge much, much more for GPUs than if you run a local machine.
Cloud GPUs are also a lot slower than state of the art consumer GPUs.
Cloud GPUs: much slower, less available, much more expensive.
However, lots of people only need those accelerators once in a while, so time sharing (aka cloud computing) makes a lot of sense and saves a ton of money overall. For FPGAs and some compute GPU applications, not having to handle support for your accelerators is also nice.
Sure you could buy all that equipment but I’d wager it’s cheaper, more agile, and greater velocity from it being in the cloud
Local GPUs are a big up-front cost. But assuming that your workload is stable, in the long run I think local GPUs ends up being cheaper per-hour than cloud.
For startups, it doesn't make sense to make the up-front purchase, fine. But if you're optimizing for long-term (amortized) costs, I'd be curious if cloud is cost-effective.
But yes, If a single workstation can meet your gpu training needs then it’ll be cheaper with sufficient usage
GPUs are better run close to your data. If you're training on-prem then your data needs to be on-prem too.
AWS for sure has had resource constraints in different AZs (especially during flack Friday and holiday loads) but I have never had an issue finding resources to spin up especially if I was willing to be flexible on vm type.
The original poster probably has the ability to spin up other instance types in their region. If there is no compute capacity in the entire region, something went wrong operationally.
I'm not suggesting you should put in a request for every new resource you need, but if you have a specific instance type or a large number needed, it helps. You're not losing the ability to shut them down the next day if you don't need them, you're just telling the Azure team that you expect to spin some up around a certain time. If you're making a significant request of compute capacity, the team has the ability to reserve those instances for your subscriptions so that you're not competing with others for those cores.
Besides what’s already been said, internal capacity differs HUGELY based on VM SKU. If you need GPUs or something it’ll be tough. But a lot of the newer v4/v5 general compute SKUs (D/Da/E/Ea/etc) have plenty of capacity in many regions.
If changing regions sounds like a pain, consider gambling on other VM size availability.
(azure employee)
Are you really sure you shouldn't just buy a bunch of machines (500cores/2TiB go for ~60k€), throw them into a colo and then spend that time on actually doing stuff?
Yikes, this is totally the first thing you need to come to expect when working with MSFT.
The timeframe they gave would match that kind of ask.
I wonder whether you see the same behavior from other cloud providers there (ie if you ask them whether new capacity is available, what do they say)
I doubt it. It will be easier - and probably safer - to ask citizens and physical industry (eg, factories) to bear the brunt than to risk having problems in critical IT infrastructure. Ask people and factories to turn the heat 3 degrees down and the effects will be more or less predictable. Asking to shut compute power down at random will have unpredictable consequences.
I suspect AWS and GCP just have more headroom in EU.
List what you do you have available so we can choose.
Do not force users to randomly guess and be refused until eventually finding something available.
I can see why they wouldn't want to do this.
It’s infuriating that AWS doesn’t have an API that returns a list of AZs with available inventory for a given instance type.
There’s lots of providers apart from AWS/Azure/GCP.
Or buy a machine and put it in your office.
Self hosting can often be cheaper and more available and probably faster than using a cloud.
I don’t know about price point. Dedicated servers can be cheaper than cloud in many cases, if you have the appropriate know-how, and the cloud business is very profitable for a reason.
I don't always care if you give me a E8_v4 or a D8 instead, just give something. With all the 100 of variants of VMs that are available, finding an exact match is obviously an unnecessary constraint. Maybe they already simulate this behind the scenes, I don't know, though given the sizes are advertised with HW capabilities I'd imagine they can't really simulate a v4 using a v5 and vice versa.
Only place I've seen compute be treated this fluidly is in Container instance, which is a bad choice for many many other reasons.
It's not the exact metric but you can find which have more availability without knowing the exact number (which is constantly changing anyway)
1. Government/health/defence cloud customers
2. Teams, which was exploding in use and they wanted to capitalise on it
3. Regular cloud customers
You will be threatened by your own unreliability of building something that's dependant on one region or one cloud.
I suspect that the skills for real HA are atrophying because for 99% of the people multi-AZ is enough and most of the AWS stuff supports multi-az automagically.
The problem with multi-region is that it means configuration, and there are probably lots of services that you can't actually configure to be multi-region. Cognito is one off the top of my head. It looks like the various aurora flavors do multi-region, but what about Neptune? SQS? API Gateway? AWS Lambda? MediaLive?
Maybe you can hide all that behind DNS failover, maybe you can't.
Real multi-region is basically means going back to old-school HA, and that was hard to do when it was your data centers. On AWS it'll be even harder.
That isn't to say it's not possible, it's just a tremendous amount of work.
I mean really, if us-east-1 is down 80% of the internet is screwed...so from an expectations point of view does HA of your particular service matter if that happens? Even for a financial outages happen.
Once you have enough people it might be worth it. For a non mission critical startup? No fucking way.
This has vast benefits for agility and fast development when developers are not always fighting the build system and have a "no fear" attitude about deployment.
If you have that, you can build a system in another region and be able to migrate wholesale to another region with more capacity and not be particularly concerned about the general problem of coordinating the service across multiple regions at the same time.
Everything is for sure until it’s not.
For some clouds that seem to be run on a manual process (IBM, Oracle) that would be expected, since they're sort of clunky. For other places (Rackspace, etc) it would uncommon. For a major provider like Azure, well, it's bizarre. I mean, the whole point of cloud is that it's all-you-can-eat.
You would think that this would be something they would advertise/talk about up-front. But who would sign up if that was disclosed?
It is shocking to me that it happened at all. Capacity planning shouldn't be so far behind in a cloud that wants to position it as being on-par with AWS/GCP. (Which Azure absolutely isn't.) To me, having capacity planning be solved is part of what I am paying for in that higher price of the VM.
> We never thought our startup would be threatened by the unreliability of a company like Microsoft, or that they wouldn’t proactively inform us about this.
Oh my sweet summer child, welcome to Azure. Don't depend on them being proactive about anything; even depending on them to react is a mistake, e.g., they do not reliably post-mortem severe failures. (At least, externally. But as a customer, I want to know what you're doing to prevent $massive_failure from happening again, and time and time again they're just silent on that front.)
It took me six months to get approved to start six instances! With multiple escalations including being forcibly changed to invoice billing - for which they never match the invoices automatically, every payment requires we file a ticket.
Unlike GCP and Azure, all AWS regions are (were) partitioned by design. This "blast radius" is (was) fantastic for resilience, security, and data sovereignty. It is (was) incredibly easy to be compliant in AWS, not to mention the ruggedness benefits.
AWS customers with more money than cloud engineers kept clamoring for cross-region capabilities ("Like GCP has!"), and in last couple years AWS has been adding some.
Cloud customers should be careful what they wish for. If you count on it in the data center, and you don't see it in a well-architected cloud service provider, perhaps it's a legacy pattern best left on the datacenter floor. In this case, at some point hard partitioning could become tough to prove to audit and impossible to count on for resilience.
UPDATE TO ADD: See my123's link below, first published 2022-11-16, super helpful even if familiar with their approach.
PDF: https://docs.aws.amazon.com/pdfs/whitepapers/latest/aws-faul...
There are aws region partitions - general, china, us gov cloud (public), us gov secret and us gov top-secret.
Inside a partition, there can be some regions that are opt-in - see https://docs.aws.amazon.com/general/latest/gr/rande-manage.h...
My understanding is that opt-in regions are even more isolated inside a specific partition for partition-global services like IAM and maybe some other stuff.
Could you elaborate on this a little? We use AWS, but are evaluating OCI for certain (very specific) cases, and I'll love to know what questions to ask for comparison purposes.
Here is how partitioned/isolated OCI is by design:
https://www.wiz.io/blog/attachme-oracle-cloud-vulnerability-...
While that's fixed, it speaks volumes to the architecture. Very little has changed since 2018: https://www.brightworkresearch.com/how-to-understand-the-pro...
As noted there, I'd argue OCI is more akin to Softlayer/Bluemix than to GCP, Azure, or AWS, but depending on your certain very specific cases OCI may still be appropriate.
That said: We also had this issue on GCP last month.
We found that all three (AWS) are unreliable in their own ways.
The strategy they helped us arrive at was two-pronged:
1. Pre-launch all needed infrastructure. Yes, for all their "cloud scale", it was actually suggested that we preallocate all of our servers the week before, rather than rely on autoscaling.
2. Order capacity reservations for all of those instances (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capa...). This ensure that, if any of those instances go bad, we'd be able to relaunch them without going to the back of the line, and finding out that there was no more compute capacity available.
While we could've just swapped a deployment parameter to deploy to another region, we opted to just use a different SKU of VMs for a short period and switch back to the VMs when they were available again.
We haven't seen issues since.
^^ and by capacity I am talking like 10's or 100s of vms being available not 1.
It is the function of the capacity manager to help you plan ahead based on what the data center capacities look like going into the future.
Meet monthly with your capacity manager. Get representation across different technology interests - database, compute, storage, event hubs, etc. Don't ever skip these meetings.
Well that's an unfortunate acronym collision.
GODDAMN.
Sidebar: MSFT is the king of acronym collisions.
I'd be surprised if other cloud providers aren't doing that in some form. I only have experience with Azure (so far).
It’s crazy that this could be valid advice, but it is.
(Speaking from experience - one of our portfolio companies had a similar challenge and we used our network to get to one of the execs of the vendor involved)
Something to try in scenarios like this is to add the “weird and wonderful” VM SKUs that are less popular and may still have capacity remaining.
For example, the HPC series like HBv2 or HBv3. Also try Lsv3 or Lasv3.
Sure they’re a bit more expensive, but you only have use them until April.
I worked for a company that worked mostly on-prem until 1y ago and last time they had ordered machines availability from Dell was scarce with huge delays.
Are you stuck only to the German region, and can't go to other European regions?
Had you never heard about (and this is unfortunately not a joke) Microsoft’s music service they once had, shut down after a few short years leaving customers without the ability to listen to the music they had paid to listen to?
The service was called, this was the trademarked name, Microsoft “Plays for Sure.” You cannot make this stuff up.
I also sort of suspect the spot market is less robust there. Lots of Azure is lift and shift on premises workloads, and those aren't using spot. Without people using spot, it's even harder to have spare capacity...
our quota has been silently set to 0 while there where still instances running. this worked fine until auto-scale scaled the instances down in the night to 1. at the start of the day auto scale was not able to scale back up to the initial amount which did lead into heavy performance issues and outages. we needed to move the instances as azure support did not help us. after many calls with azure and multiple teams involved we finally did not get the quota approved (even if we did have it already and was not asking for „new“ quotas).
also we decided to not be able to host in the German azure region anymore. Even if we could get the quota this is a business risk we don’t want to bear anymore to not be able to scale for unexpected traffic.
this is huge for us as our application requires German servers. We are still in research where to host in future.
They cannot warn you because it's very hard to predict how many new customers will come or if the existing ones will create more instances.
I know about a bank with the same issue, basically, they've hogged all the resources in a specific region and yet, they need more. Unfortunately this things take time, MS cannot setup a new datacenter in a couple of days.
>but setting up in a new region would be complicated for us.
Why? it's easy: https://learn.microsoft.com/en-us/azure/azure-resource-manag...
Latency issues from app to DB?
I've never done K8 on Azure, but my understanding is that Azure is pretty good about coordinating between your own datacenter running windows and Azure. Maybe you can spin up some windows boxes in a cheap datacenter to make it work?
I sure hope so, as a German company
This seems like your fundamental problem. If you design an architecture that is limited to a single region of a single cloud provider, you are very likely to encounter issues at some point.
Luckily you have a full month to solve this problem before it will prevent you from accepting new users. My suggestion is to start making your app multi-regional or multi-provider ASAP.
If you aren't a big spender you may not have a TAM who can get this info for you. Welcome to Azure.
Terraform has different providers for each cloud provider and the code is not transferable any more than saying if you use Python to script your infrastructure it will be transferable.
The reason for Terraform, and it's a good one, is your Terraform-related tooling doesn't have to change, e.g. if you route all your infra change approvals through Terraform Cloud), and you can coordinate multi-service changes, e.g. update Auth0 infra to do X, then AWS to do Y.
This seems like a much larger issue than they're making it seem. The promise of the cloud was unlimited scalability. I never thought of cloud resources as finite.
If you need more reliability, I see only one way out: Go multi-region or even multi-cloud.
(You could depend on another startup with no revenue).
Maybe you can spin up some part of the infrastructure that are not latency sensitive in the nearby region?
(Also... If into k8s, python, GPUs, graphs, viz, MLOps, working with sec/fraud/supplychain/gov/etc customers on cool deploys, and looking for a remote job, we are hiring for someone to take ownership here!)
https://www.giantswarm.io/
(I work at Giant Swarm.)
Now, they could go the GDPR/Cookies route and prompt absolutely every user on pageload, but doing so would annihilate the purpose of the law into monotonous smithereens, just as it did with Cookies. Good on them for defaulting to the "more secure" mode, but yes this is a potential consequence.
Happy to hear from any German amigos present if I've got something wrong. (But watch out... you might be putting HN at risk - their servers aren't (likely) in Germany!)
[1]: https://incountry.com/blog/which-german-data-privacy-laws-yo...
https://learn.microsoft.com/en-us/previous-versions/azure/ge...
Daily reminder that cloud services are vastly less reliable than traditional hosting; it’s just that they manipulate the terminology to deflect that, replacing reliability with availability, aka “making impression of working”.
They seem to IIgnoRe, then repent.: finally APologgise.:(
I think u should switch to a new COMpuute. GCc.-pp.??
When we were running our own compute back in 09: and resources ran out or were unreliable, we cld shOUt at the server maintainer and/OOr install better hardware oUUrselves. NOt-THE.case anymore.:( :((
-Vip
Which brings me to another important point. If we run out of computers meaning supply can't keep up with demand, then who are the winners? The people who own the computers. Cloud providers and self hosters. Because of the high demand cloud providers can raise their prices and that's directly converted to profit since expenses remain the same, i.e. price gouging. Good job all you cloud loyalists who use the cloud for everything.