Ask HN: Azure has run out of compute – anyone else affected?

651 points Nov 25, 2022

Last week we at n8n ran into problems getting a new database from Azure. After contacting support, it turns out that we can’t add instances to our k8s cluster either. Azure has told they'll have more capacity in April 2023(!) — but we’ll have to stop accepting new users in ~35 days if we don't get any more. These problems seem only in the German region, but setting up in a new region would be complicated for us.

We never thought our startup would be threatened by the unreliability of a company like Microsoft, or that they wouldn’t proactively inform us about this.

Is anyone else experiencing these problems?

l-p Nov 25, 2022

> We never thought our startup would be threatened by the unreliability of a company like Microsoft

You're new to Azure I guess.

I'm glad the outage I had yesterday was only the third major one this year, though the one in august made me lose days of traffic, months of back and forth with their support, and a good chunk of my sanity and patience in face of blatant documented lies and general incompetence.

One consumer-grade fiber link is enough to serve my company's traffic and with two months of what we pay MS for their barely working cloud I could buy enough hardware to host our product for a year of two of sustained growth.

Twirrim Nov 25, 2022

It's worth pointing out that every cloud is the same when it comes to capacity / capacity risk. They all apply a lot of time and effort to figuring out the optimal amount of capacity to order based on track record of both customer demand and supply chain satisfaction.

Too much capacity is money spent getting no return, up front capex, ongoing opex, physical space in facilities etc.

On cloud scales (averaged out over all the customers) the demand tends to follow pretty stable and predictable patterns, and the ones that actually tend to put capacity at risk (large customers) have contracts where they'll give plenty of heads-up to the providers.

What has been very problematical over the past few years has been the supply chains. Intel's issues for a few years in getting CPUs out really hurt the supply chains. All of the major providers struggled through it, and the market is still somewhat unpredictable. The supply chain woes that have been wrecking chaos with everything from the car industry to the domestic white goods industry are having similar impacts on the server industry.

The level of unreliability in the supply chain is making it very difficult for the capacity management folks to do their job. It's not even that predictable which supply chain is going to be affected. Some of them are running far smoother and faster and capacity lands far faster than you'd expect, while others are completely messed up, then next month it's all flipped around. They're being paranoid, assuming the worst and still not getting it right.

This is an area where buying physical hardware directly doesn't provide any particular advantages. Their supply chains are just as messed up.

The best thing to try to do is do your best to be as hardware agnostic as is technically possible, so you can use whatever is available... which sucks.

marcinzm Nov 25, 2022

In my experience there are differences between clouds so while all have the same basic problem in practice some may be better than others. I've never had issues getting GPUs on AWS but GCP constantly has issues with GPU/TPU capacity.

indoorskier Nov 25, 2022

Is this region dependent? In us-east I can’t get them to approve a quota for GPU instance families (G,P) for anything more than 4 CPUs. At one point they rejected my request citing “unprecedented demand”. Of course this is small time, just my personal account.

It is true I can get an instance most of the time, but not if I need >16GiB GPU memory.

bane Nov 26, 2022

We've been having the same problem getting GPU instances on us-east. Multiple week-long delays to escalate and talk to yet the next person up who can make a decision. It's a mess.

ajmurmann Nov 25, 2022

There probably are difference occurrence rates. We had to modify how our test suite provisions instances, since we used to regularly run into instance availability constraints on EC2 during the holidays.

AmericanChopper Nov 26, 2022

I’ve occasionally seen some of the internal AWS capacity management dashboards, and they can frequently be operating very close to 100% on some resource types.

jerpint Nov 26, 2022

I worked on a project about a year ago where we would have a colleague in a different time zone start instances with 4 gpus because it would almost always be unavailable during regular work hours for us-east

whoknew1122 Nov 25, 2022

It may be a risk borne by every cloud provider, but why does this only really happen to Microsoft among large providers?

As far as chip shortages, it probably helps that Amazon makes its own chips. Microsoft could do the same rather than running out of capacity and blaming chip shortages.

Microsoft had to know that at some point they were going to run out of capacity. They should've either did something about it or let customers know.

Twirrim Nov 25, 2022

There's all sorts of examples of AWS failing to be able to provide capacity too. Just do a search for "aws InsufficientInstanceCapacity" or similar. I remember Fortnite talking about capacity limits in relation to an incident, but I'm struggling to find the post-mortem I saw it in.

Even when Microsoft was being open about Azure having difficulty getting Intel chips, AWS, GCP etc. were in the same position and just not really talking about it. From my time in AWS there were some other times when some services with specialised hardware came really, really close to running out of capacity and had to scramble around with major internal "fire drills" against services to recoup capacity.

Most people won't run in to these issues, the clouds all tend to be good at it, but they still happen.

There are also advantages of the economy of scale and brand recognition. The more customers you have the more the capacity trends smooth out, the easier it is to predict need, even if you're still stuck with uncertainty on the ordering side.

Aeolun Nov 26, 2022

It’s certainly true I run into these things with AWS as well, but it’s generally limited to a specific instance type/availability zone combination. I’ve never had all instance types unavailable.

If anything, I’m surprised we can just spin up a few hundred instances out of nowhere and not run into capacity issues.

llama052 Nov 28, 2022

AWS has capacity issues you can generally mitigate. Azure however will just lock you out of a solution completely and tell you to switch regions as if that was some trivial thing.

Spooky23 Nov 25, 2022

They have a lot of technical debt. They have like 6 different clouds (at least 4 gov clouds alone) and SLA commitments to things like O365 that silo their infrastructure.

MS also makes all sorts of crazy deals and commitments, and I wouldn’t be surprised if being collocated with a strategic customer may lead to local shortages of resources.

whoknew1122 Nov 26, 2022

AWS has at least 3 publicly-discussed 'clouds' (or partitions, as they're called at AWS). There may or may not be other partitions that cannot be discussed publicly.

Spooky23 Nov 26, 2022

There’s a pretty clean demarc between the AWS clouds. With Microsoft because they have O365 and Azure AD dependencies sprinkled everywhere with varying features it’s a real mess. So you can do government contract with with device managed by Windows Autopilot & Intune in a commercial cloud, have email in a Gov Community Cloud, and deliver apps in a US Gov cloud, all with different identities etc.

hitpointdrew Nov 25, 2022

> As far as chip shortages, it probably helps that Amazon makes its own chips.

IDK what chips you are talking about, all x86 (which I assume is most of their compute) is Intel or AMD. If they make their own they are only making the ARM instances.

whoknew1122 Nov 26, 2022

AWS has three processors: Graviton, Inferentia, and Trainium. They're made in-house.

https://aws.amazon.com/silicon-innovation/

boarush Nov 27, 2022

And none of the above are x86. Even if they're making their own silicon, it is for specialized use (ML) and not general server provisioning.

cosmotic Nov 25, 2022

Amazon's own chips are ARM. ARM requires somewhat specialized builds of software that are likely different than development instances, CI/CD, and/or local dev machines. It's not insurmountable but does certainly complicate usage.

philwelch Nov 26, 2022

Your local dev machines might be Macs though, in which case it might be easier for you to go with ARM servers than x86.

cosmotic Nov 26, 2022

They might be. My local dev machine is a Mac. I've found Intel or Intel+ARM container images; never an ARM only. Again, not insurmountable but certainly more resistance than the straight intel route.

Spooky23 Nov 25, 2022

> This is an area where buying physical hardware directly doesn't provide any particular advantages. Their supply chains are just as messed up.

Yup. And a few of the OEMs have stopped talking about supply chain integrity. Many folks have observed more memory and power supply problems since the pandemic.

more_corn Nov 25, 2022

All cloud providers are NOT equal here. Amazon over-provisions and sells the excess capacity as spot instances.

Twirrim Nov 25, 2022

So does google, so does azure etc. etc. https://cloud.google.com/spot-vms, https://azure.microsoft.com/en-us/products/virtual-machines/...

Spot instances exist just to try to turn over-provisions in to not a complete loss. You're at least making some money from your mistake.

edit: You should consider "spot instances" in general to be a failure as far as a cloud provider is concerned. It means you've got your guesses wrong. You always want a buffer zone, but not that much of a buffer zone. The biggest single cost for cloud providers is the per-rack OpEx, the cost of powering, cooling etc.

femto113 Nov 25, 2022

Cloud providers aren't guessing at demand to plan capacity, they're literally building new data centers and then wheeling new racks into them as fast as they physically can (short-term decisions are more likely made at the other end, e.g. when to retire old systems, not add new ones). AWS was born out of the fact that Amazon's own compute needs are inherently variable so to meet peak demand they had to "over-provision" compared to average demand--this in turn meant they had a lot of excess compute power most of the time. At the point when Amazon still was a dominant consumer of AWS, spot instances were actually a deliberate convenience to Amazon, since it meant AWS could monetize resources while still ensuring Amazon could claim them instantly when needed (later they added a two minute warning, but early on they could literally disappear at any moment, and regularly did).

2 More Comments →

jiggawatts Nov 25, 2022

So does Azure.

moralestapia Nov 26, 2022

Never happened to me in AWS.

Wasn't the whole point of "the cloud" that these things shouldn't happen?

adrr Nov 25, 2022

Azure has some of the biggest outages like when they went down on Feb29th for the whole day.

https://azure.microsoft.com/en-us/blog/summary-of-windows-az...

jepler Nov 26, 2022

It seems like in nearly 3 out of every 4 years the whole internet is unusable on February 29... why pick on microsoft?

Godel_unicode Nov 25, 2022

10 years ago, has there been something similar recently?

flippingbits Nov 25, 2022

The last one I remember is this one from August this year: https://redmondmag.com/articles/2022/08/30/microsoft-blames-... It was not a complete outage but these DNS issues caused a lot of pain.

rufius Nov 25, 2022

Having worked for a company that's a very large customer of AWS's, it's not much better.

I've worked with both Azure and AWS professionally and both have had their fair share of "too many outages" or capacity issues. At this point, you basically must go multi-region to ensure capacity and even better if you can go multi-cloud.

janober OP Nov 25, 2022

We actually use Azure for ~2 years now. It worked the most time reasonably well, even though we had also a few issues. But our current issue + ready your and other comments will probably result in looking for a new home.

ckdarby Nov 26, 2022

> One consumer-grade fiber link is enough to serve my company's traffic and with two months of what we pay MS for their barely working cloud I could buy enough hardware to host our product for a year of two of sustained growth.

I don't believe that is even remotely correct.

It isn't the pricing you should be worried about but the staffing, redundancy, and 24/7 operations staff.

I'm dealing with AWS and on-prem. On-prem spent some $5M to build out a whole new setup, took literal multiple months of racking, planning, designing, setting up, etc.

It's not even entirely in use because we got supply chain issued for 100 Gbit switches and they won't be coming until at least April of 2023 (after many months of delays upon delays already).

aprdm Nov 26, 2022

Depending on your scale, things are really not that complicated. If you can run your company from a single machine, having two for redundancy, and two internet links for redundancy, will likely go a loooooooooong way until something bad happens...

ethbr0 Nov 25, 2022

Out of curiosity (from someone inexperienced with Azure), is it a skill/ability chasm between MS engineering and outsourced support?

TAMs tend to be a bandaid organizational sign that support-as-normal sucks and isn't sufficient to get the job done (ie fix everything that breaks and isn't self-serve).

Spooky23 Nov 25, 2022

Microsoft support is really awful. Basically, if you need it regularly, you just pay for resident engineers who can bypass the wall between the product groups and you. I’ve had nothing but great experiences with those guys.

Otherwise, especially if there’s a broader problem, they play lots of games with SLAs, etc.

aprdm Nov 26, 2022

YES! We tried a big project in the cloud (many many many high end VMs), and Azure was SO unreliable. From BGP configs fuck ups to obscure bugs in their stack.

Their support was also amazing in the beginning.. but after they hooked you up... you're just a ticket in their system. Takes weeks to do fix something you could fix in minutes on-prems or that their black belt would get fixed in a very short amount of time in the beginning of the relationship.

Cloud isn't that magical unicorn!

SergeAx Nov 26, 2022

Yes, and what is your contingency plan for said fiber going dark?

roflyear Nov 25, 2022

I have DB connection issues at least a few times a week. Annoying.

marcosdumay Nov 25, 2022

New Microsoft customer at all.

Insanity Nov 25, 2022

The common argument of "our own hardware would be more profitable in X years" is typically countered with "but you need to pay engineers to maintain it, which adds to the cost".

Another advantage of not having to own the hardware is that it's easier to scale, and get started with new types of services. (i.e, datawarehouse solutions, serverless compute, new DB types,..).

I'm not trying to advocate for or against cloud solutions here, but just pointing out that the decision making has more factors apart from "hardware cost".

unionpivo Nov 25, 2022

Depends on how stable your needs are, but sometimes its cheaper even when you considerer total cost and not just for big deployments.

In the past 2 or three years, we probably moved more services off the cloud than other way. That said one reason for that is that most new services are build in the cloud, so there are less services off the cloud than on it.

Cloud is best, when you are starting out, when you don't know what you need, need high velocity of adding new stuff, of have very burst like demand for either traffic or cpu etc. Or if you are just small developer only team.

But if you have applications that are relatively stable, are mostly feature complete and you don't expect much sudden growth etc, it's useful to run the numbers if cloud is still something you want/need.

xwowsersx Nov 25, 2022

Oof, that sucks and I feel for you. That said...

> setting up in a new region would be complicated for us.

Sounds to me like you've got a few weeks to get this working. Deprioritize all other work, get everyone working on this little DevOps/Infra project. You should've been multi-region from the outset, if not multi-cloud.

When using the public cloud, we do tend to take it all for granted and don't even think about the fact that physical hardware is required for our clusters and that, yes, they can run out.

Anyways, however hard getting another region set up may be, it seems you've no choice but to prioritize that work now. May also want to look into other cloud providers as well, depending on how practical or how overkill going multi-cloud may or may not be for your needs.

I wish you luck.

astrostl Nov 25, 2022

"Multi-cloud from the outset" is probably the single-worst generic cloud advice that I think anyone could be given. In professional cloud consulting the rule of thumb is to do one cloud with excellence until you even think about another one. And even that is really just kicking the conversational can, as both becoming excellent and actually needing multi-cloud combined is one-in-a-billion.

gumby Nov 25, 2022

That pretty much binds your hands since in our experience the one provider who can do “one cloud with excellence” is AWS.

(As an aside I also agree that multi cloud from the get go is a YAGNI violation. Just keep in the back of your mind “could we have an alternative to this?” when using your provider’s proprietary features.)

marcosdumay Nov 25, 2022

That generalizes to every kind of lock-in: have a viable escape plan, but only execute it if you need or it becomes cheap enough that it won't harm you.

Just having the plan is already expensive enough.

jiggawatts Nov 25, 2022

My experience is the opposite: AWS has more features on paper but most of them exist only to tick a checkbox. Azure has more integrations between their offerings, as well as Azure Active Directory, and Microsoft 365.

dangus Nov 26, 2022

Why do I want AD or Microsoft 365?

7 More Comments →

gumby Nov 26, 2022

We’ve had reliability and availability problems esp with azure and also Google. less with AWS.

None are ideal.

29athrowaway Nov 25, 2022

And Active Directory integrates horribly with everything outside Microsoft.

2 More Comments →

kennxfl Nov 27, 2022

Just to steel man your statement: You should strive for excellence in deployment with all providers (dabble) but have your initial core setup on one cloud (YAGNI principle) until when multicloud capacity is needed (at scale)

aftbit Nov 25, 2022

If you use generic enough services (container hosts, load balancers, VMs, object stores, even hosted SQL DBs, etc), then the multi-cloud journey is not that hard. The challenge comes when you have build a whole architecture on top of some AWS magic that simply does not have an easy alternative in the non-cloud world.

xwowsersx Nov 25, 2022

Who are you quoting? I said "multi-region from the outset" and later acknowledged that multi-cloud would probably be overkill.

worthless-trash Nov 26, 2022

This was exactly what I was thinking, its amazing what people read into, probably myself included.

ghayes Nov 25, 2022

Completely agree, though certain aspects, such as running on k8s or Docker might make it easier to switch if you ever decide to, versus say, being tightly coupled with many bespoke cloud products.

rikthevik Nov 25, 2022

My philosophy is to make switching to a new cloud possible. It doesn't have to be easy. We just shouldn't nail our feet to the floor.

d3ckard Nov 25, 2022

Or you could just deploy on metal, which will be cheaper and sufficient for vast majority of cases. Plus you can always migrate to VMs with relatively low hassle.

arghwhat Nov 26, 2022

You always only need another one when everything has gone to shit, either from failure or cost from vendor-lock-in, so drinking your chosen providers Kool aid equals taking the reactive route and scrambling to rearchitect when the issues hit.

Multi-cloud is really not a big deal. Main nuisance is billing differences, followed by slight variations in e.g. Terraform config.

askvictor Nov 25, 2022

On top of which, startups often don't have that luxury; you have often need to ruthlessly prioritise your effort.

brian_cunnie Nov 26, 2022

True. When Nextel was bootstrapping in the 90's, a VP said, "We have to buy gas for the car now. Later we'll buy the seat belts"

nevf1 Nov 28, 2022

Totally agree. If the service you're providing is so important, build your system so it can fly on one engine or at least land safely. Multi-cloud is the equivalence of trying to transfer all of your passengers to a different aircraft mid-air.

Multi-cloud should only be for mission critical infrastructure. Very little infrastructure is mission critical. Most other use cases can be temporarily wallpapered over with an "Under maintenance" page unless there's a good reason otherwise.

Multi-cloud introduces more risk than it prevents. Which is why things like simulated failovers and BCP testing is constantly required.

elwell Nov 25, 2022

Surely it would depend on the reliability demands of your product.

theknocker Nov 26, 2022 (dead)

xwowsersx Nov 25, 2022

I'll reply to my own comment in response to a since-deleted reply that went something to the effect of "this is terrible advice for a young startup trying to get to product market for":

I'm totally on board with the idea of being scrappy and taking shortcuts in order to get to PMF as soon as possible. However, it seems the proof is in the pudding here. If you can't service customers due to lack of compute resources, you can't get to PMF.

Also, yes there are certain infrastructure and network topologies that would absolutely be overkill for a young startup. I don't think multi-region is one of those things. I don't have experience with Azure directly, but on every other cloud providers, going multi-region is not something that requires huge amounts of time or resources. You just need to be mindful of it from the outset. And if you decide not to be, then at least be intentional and conscious about the risk and have a plan in place for what happens when you get bit by deciding not to go multi-region.

logisticpeach Nov 25, 2022

I'd add to this by asking: how much more PMF can you get when you have a two week horizon of new customers before you literally run out of compute resource in a major cloud provider data centre?

Sounds like customers are coming in thick and fast.

If this is the dynamic and the company can't spare a few weeks to solve it, something has gone seriously wrong in a very interesting way.

arcturus17 Nov 25, 2022

Also, n8n arguably has product-market fit so the advice was impertinent to start with…

cheese123 Nov 25, 2022

GDPR might be a problem here. But this brings us to an important point: this is not your infrastructure, but someone elses.

soco Nov 25, 2022

There are quite a few European regions, but we don't know how the others stand with their computing limits...

masom Nov 25, 2022

GDPR isn't really related to the infrastructure, and isn't a problem if you built your product knowing you'll need to conform. Shopify is GDPR compliant, for all merchants, and runs on Google Cloud in multiple regions.

revicon Nov 25, 2022

There are specific requirements in Germany that require user data to not leave the country. I believe that was what OP was referring to.

pm90 Nov 25, 2022

Genuine question: is knowledge on how to do this well known? Without that accessibility, I'm picturing folks operating in EU being unwilling to take the risk of not being compliant and just hosting everything in a single region.

2 More Comments →

nr2x Nov 25, 2022

Privacy shield very much matters where your servers are. EU cracking down hard on extra territorial transfers in the past year with more to come.

Also, lots of companies assert GDPR compliance via magical thinking. They most often are wholly wrong. Shopify can say whatever they want, but there’s no certification body.

Source: I’m the person who evaluates and builds compliance systems for a range of services you almost definitely use.

xwowsersx Nov 25, 2022

Good point, I didn't think about that.

mhitza Nov 25, 2022

I take offense with you comment. It's not the first time I'm hearing about multi-region/multi-cloud in online tech forums, however reality doesn't match.

I don't want to be snarky, but when large service providers like AWS have their own crossregion downtime because one snowflake of a service in us-east-1 is down, I kind of dismiss the virtue signaling of high resilient multi-(az/region/cloud) ever existing in practice.

If you can somehow have a separate database per region/cloud, sure, I can understand that, but if you have to shard your database across many clouds, I'd dread having to tame such a beast, especially within a startup.

panarky Nov 25, 2022

> dismiss the virtue signaling of high resilient ...

So you're saying it's impossible to improve reliability from 97% to 99% because you can never make it to 100%.

quesera Nov 25, 2022

If your single-AZ, single-region cloud is not giving you 3 or 4 9's of reliability out of the box, you are using the wrong cloud.

Multi-AZ and multi-region add complexity and cost much more quickly than they add reliability.

Sometimes it is worth it. Sometimes it is not.

champtar Nov 25, 2022

Depends on your needs, but having your data & database multi-az to ensure durability can avoid you having to restart from backups. I'm thinking about an old AWS incident were they actually lost EBS data: https://www.bleepingcomputer.com/news/technology/amazon-aws-... Also make sure your backups are in a different AZ (thinking about OVH ...) or region or even at a different provider.

janober OP Nov 25, 2022

Thanks a lot! You are totally right, it is for sure something we will find a solution for. But honestly, do I not want to. As a startup, you have very few resources and deliberately place some exact bets. Deprioritizing everything to work on something for a long time that was not prioritized, just to then end up again where you were before (a working cloud solution) is the last thing any startup should be forced to do. Anyway, it seems like we do not have much choice here.

xwowsersx Nov 25, 2022

I hear you. It's not a fun position to be in. And sometimes you're correct to take calculated risks and maybe the expected value was positive here, despite what ended up happening.

Without knowing the details about your services and infrastructure, it's hard for me to know what's involved in going multi-region now. Are you sure it's such a a gargantuan effort? I would've thought one person working full-time on this for a week or two would be enough, but again I don't know the details of your setup.

One option would be to pay a consultant who is an expert in Azure/cloud stuff to come in and help. May not be cheap, but could be a lot better and quicker for you and better for the business, especially if none of you are really big experts in Azure.

I've been here before (I think)...had to wear many hats and scramble to make sales, build the tech, act as de facto DevOps person even without a lot of experience doing it, etc. That is the way, but stuff happens.

Happy to chat about specifics if you want to bounce ideas off of me or go through your particular situation. Can't promise I'll have concrete advice, but happy to talk it through.

janober OP Nov 25, 2022

Thanks a lot, is really super nice of you and appreciated! Luckily we have somebody very knowledgeable on our team. Will tell him to reach out if he wants to have a peer to brain-storm some ideas.

xwowsersx Nov 25, 2022

Glad to hear you have the right people. Good luck, my friend.

martinald Nov 25, 2022

I disagree with the other poster you should have been multiregion from the start. It adds a load of complexity and failure cases for early stage Startups.

Very poor position to be in, apparently this happened in azure UK recently too.

dh2022 Nov 26, 2022

I do not think is a bad idea to be multi-region from start. For the most part Azure has at least two regions in each country (Germany North/West Central, UK South/West, Sweden South/Central, Norway East / West, UAE North / Central, France Central / South etc....)So if stuff happens being able to bring up your service in a different region in the same country could be helpful. I do not know specifics but it seems to me that having an abstraction layer on top of the region is not that hard to do (most of Azure services are supported in all regions). OF course, is a lot easier if done at the outset. Being forced to do it quickly and with little notice is no fun at all....

jnsaff2 Nov 25, 2022

I feel for you. Also it sucks to be in this position.

Let the scar you get from this is be a learning experience, hopefully you will not fall into the same trap again to trust this company.

In my career I'm in a place where anyone suggesting I do work on Azure gets an instant doubling of my asking day-rate and I really hope the will be put off and find another victim for this gig.

That said, another learning experience would be to use terraform or something (tbh for azure the only sane thing is terraform, ARM templates are just garbage). Having terraformed your one region switching to the other would be much easier, tho not trivial.

hgsgm Nov 25, 2022 (dead)

websap Nov 25, 2022

Is it that much cheaper for you to build a new region on Azure versus getting setup on AWS?

If you rely on Kubernetes for orchestration and have minimal cloud API dependency, it may be worth that evaluating this option.

Also, do you have a TAM associated with your account? Are you just going through regular support channels? Can they deliver different instance types (not sure what the Azure parallel is), can they deliver short term capacity, etc?

I would try to push Microsoft more here. It's not like they've stopped on-boarding new customers into that region right? What happens if you create a new account in that region?

janober OP Nov 25, 2022

We already tried to push Microsoft, sadly have they been not very helpful. Still trying to get in contact with somebody that can actually make a difference. After all, are we also not asking for a hundred machines. Can really not imagine that they can not somehow make the resources we require available.

dekhn Nov 25, 2022

I'll ask again what the person above asked: do you have a TAM.

If you don't you're at a big disadvantage.

cyptus Nov 25, 2022

we tried escalating this through CASM and it did not work. The region is blocked for every quota, even a single instance.

janober OP Nov 25, 2022

Sorry, missed that. No, we do not.

websap Nov 25, 2022

Get one asap. Your TAM is the insider and should push for you.

jiggawatts Nov 25, 2022

Some tricks you can try is to switch to a different SKU. Most Azure databases have different generations of underlying compute. They may be out of just on model. Try a different one.

Similarly, just keep trying to change the size. Often it’ll go through when someone else decommissions something.

snotrockets Nov 25, 2022

“Being multicloud from the outset” is a very silly idea for most use cases.

The way to get more from most cloud is by becoming a partner, not just a customer. And the way to do that is increase dependency and usage.

salawat Nov 25, 2022

t. Sales department of Cloud provider

snotrockets Nov 28, 2022

Good sales (which I'm not), is all about aligning with the customer: solve their problems for them, and they'd be happy to pay for it. Getting them to buy stuff they won't need is a sure way to loose future business.

psychphysic Nov 25, 2022

> Deprioritize all other work, get everyone working on this little DevOps/Infra project.

This is doubly worthwhile as if this stumble kills the startup (it can happen) this will be excellent experience to take to the next employer :)

lanstin Nov 25, 2022

Also a lesson here is don’t create any prod asserts manually ever. Terraform or some other software to define your cloud assets. Then this issue is just a matter of adding a top level loop or maybe adding a region parameter to a layer of software. Cloud is only efficient if you take the software defined every thing seriously. Otherwise it is premium hosting where you are likely a small fish.

rch Nov 25, 2022

Thanks largely to k8s, running on multiple cloud providers and your own hardware is a lot more convenient than it was a few years ago. Component interfaces and protocols are a lot more consistent across platforms as well.

theteapot Nov 25, 2022

> everyone working on this little DevOps/Infra project.

Everyone? That's not going to help.

craigkerstiens Nov 25, 2022

This is nothing new, Azure has been having capacity problem for over a year now[1]. Germany is not the only region affected at all, it's the case for a number of instance types in some of their larger US regions as well. In the meantime you can still commit to reserved instances, there is just not a guarantee of getting those instances when you need them.

The biggest advice I can give is 1. keep trying and grabbing capacity continuously, then run with more than what you need. 2. Explore migrating to another Azure region that runs less constrained. You mention a new region would be complicated, but it is likely much easier than another cloud.

1. https://www.zdnet.com/article/azures-capacity-limitations-ar...

grepfru_it Nov 25, 2022

>Azure has been having capacity problem for over a year now

This is also a problem internally for Microsoft. GitHub and LinkedIn still operate in private datacenters due to Azure capacity issues

rsynnott Nov 25, 2022

> In the meantime you can still commit to reserved instances, there is just not a guarantee of getting those instances when you need them.

... wait, what? How are they defining 'reserved'?

alexeldeib Nov 25, 2022

RI are a billing concept (discounted rates for long term commitment).

Dedicated capacity exists, but it’s different (compute reservation groups or dedicated hosts).

You can combine CRG/DH with RI for the desired effect, although IMO it’s a bit confusing.

(Azure employee)

robertlagrant Nov 25, 2022

It's a billing mechanism. You pay less if you guarantee use. Sadly, they don't guarantee availability of things to use :)

rsynnott Nov 25, 2022

Yup, I'm aware of reserved instances (from an AWS PoV) but I always assumed they were, at least theoretically, well, reserved!

sgerenser Nov 25, 2022

Reminds me of the classic Seinfeld car reservation bit: https://m.youtube.com/watch?v=4T2GmGSNvaM

alexeldeib Nov 25, 2022

Great bit! Same with airline overbooking ;)

aeyes Nov 25, 2022

On AWS instances are only reserved if you reserve a specific instance type in a specific zone. Reservations across multiple zones or savings plans don't reserve capacity.

count Nov 25, 2022

In the AWS context, they are, in fact. That's the original point of them - so during big AZ failures your reserved instances had first dibs on the available capacity.

The billing thing became more of the point as big AZ failures are so rare.

2 More Comments →

lr1970 Nov 25, 2022

With dedicated tenancy AWS Reserved Instances are physically reserved for you.

cfeduke Nov 25, 2022

I worked briefly in an enterprise facing sales organization that targeted multi-cloud deployments. Azure always had capacity problems.

As ridiculous as it sounds, having an enterprise's applications exist on multi-cloud isn't terrible if the application is mission critical - not only does this get around Azure's constant provisioning issues but protects an organization from the rare provider failure. (Though multi-region AWS has never been a problem in my experience, there is a first time for everything.) Data transfer pricing between clouds is prohibitively expensive, especially when you consider the reason why you may want multi-cloud in the first place (e.g., it's easier to provision 1000+ instances on AWS than Azure for an Apache Spark cluster for a few minutes or hours execution - mostly irrelevant if your data lives in Azure Data Lake Storage).

bri3d Nov 25, 2022

Every cloud provider will have these issues with specific instance types in specific regions, although the Azure Germany situation sounds perhaps a bit more dire. At my past (much larger) employers we’ve always run into hardware capacity issues with AWS too - we’re just able to work around them.

Building on cloud requires a lot of trade offs, one being a need for very robust cross-region capability and the ability to be flexible with what instance types your infrastructure requires.

I’d use this as a driver to either invest in making your software multi regional or cloud agnostic. Multi regional will be easier. If you’re already on k8s you should have a head start here.

Innominate Nov 25, 2022

As much as this happens, I don't feel it's something to be expected or even okay.

The major cloud services are expensive. This extra cost is supposed to provide for cloud services' high level of flexibility. Running out of capacity should be a rare event and treated as a high priority problem to be fixed asap.

Without the ability to rapidly and arbitrarily scale, they're just overpriced server farms.

unionpivo Nov 25, 2022

> Without the ability to rapidly and arbitrarily scale, they're just overpriced server farms.

I mean that's what cloud is (outsourced server farm). Sure they also offer services on top, but that's mostly because they want to lock you in, and can charge more for, so it's a win win for them.

And there is no magic here, someone has to get the chips, build servers and connect them to network. And while they will often overbuild for capacity, they will never do it to a degree, where they can't run out, because that would be way to expensive and not financially viable.

I don't think any cloud will ever be able to guarantee to never run out of resources.

remus Nov 25, 2022

> I don't think any cloud will ever be able to guarantee to never run out of resources.

I agree with this, but clearly there's a disconnect between how often people expect these kinds of issues and how often they actually happen. The whole point of the cloud is you pay a premium for the added flexibility. If it turns out that flexibility isn't there when you need it then maintaining your own servers becomes a lot more attractive.

bagels Nov 25, 2022

Some problems can't be fixed (eg. chip supply chain problems) even if you have more money.

Havoc Nov 25, 2022

>Some problems can't be fixed (eg. chip supply chain problems) even if you have more money.

They can't magic chips into existence, but leaving a major region like Germany high & dry for almost half a year sounds like planning went wrong frankly. If it were a matter of chips I would have thought on a 3+ month timescale they can steal a few from another region that has a bit of fat

dharmab Nov 25, 2022

> they're just overpriced server farms

That's exactly what a cloud is. It's someone else's datacenter with an API.

lanstin Nov 25, 2022

A really good API that makes it close to a software defined everything world. Which has promise.

PaulHoule Nov 25, 2022

There is a "minimal viable product" of documenting the configuration of your system so you can (1) run development, test, staging instances, (2) jump to another region when necessary, (3) from other disasters.

Ideally you have a script that goes from credentials to the service to a complete working instance.

andrewstuart Nov 25, 2022

Yes it’s weird that you have to ask them for instances which some actual physical person looks at your request, thinks about it and says yes or no to.

Instead of providing you with a list of the resources they do have, you have to play this weird game where you ask for specific instances in specific regions and then within several hours someone emails back to say yes or no.

If it’s no, you have to guess again where you might get the instance you want and email them again and ask.

I envisage going to an old shop, and asking the shopkeep for a compute instance in a region. He hobbles out the back, and after a long delay comes back and says “nope, don’t have no more of them, anything else you might want?”.

It’s surprising this how it works. Not the auto scaling cloud computing used to bring to mind.

kulahan Nov 25, 2022

I briefly worked on an Azure team, and what I remember hearing (a few years ago) was that they were building out data warehouses as fast as they can, but they simply cannot keep up with demand. A good problem to have, I thought, but maybe not in light of this news!

jnwatson Nov 26, 2022

I recently spoke to a datacenter planner. I wouldn't be surprised if the global spend on new datacenters for 2023 is on the order of $100 billion. If this continues for a few more years, this is planet-shaping change.

I just Googled it. Gartner estimated $125 billion.

https://www.fiercetelecom.com/telecom/cloud-and-colocation-d...

gtirloni Nov 25, 2022

Is this a joke comment?

selckin Nov 25, 2022

no, they have very low quotas by default, and you have to request increases through the portal, which then get rejected and you click the button to contact support/email and then you sometimes have to negotiate with them

you have to do this for every single instance type they have, can't even experiment or test other instance types cause its too much trouble to get quota

andrewstuart Nov 25, 2022

In the future it will be possible to use computers to figure out what’s available and automatically give it to customers.

21st century man…. it’s coming.

salawat Nov 25, 2022

But who will determine when more computers are needed to figure out what's available to give to more customers because there's been a spike in demand?

Computers don't fix everything. They just allow you to f*ck up bigger, harder, and faster, usually in the most banal way imaginable.

gtirloni Nov 25, 2022

The comment I replied to was not talking about changing quotas but actually creating instances.

> Yes it’s weird that you have to ask them for instances which some actual physical person looks at your request, thinks about it and says yes or no to.

selckin Nov 25, 2022

well can't create an instance without having quota available

and low quota is low, like 10 cpu, so start a 2 node k8s cluster with 8cpu each? nope, go request quota increase

oneplane Nov 25, 2022

No, Microsoft still isn't up to the 'use what you want, pay for your usage' level that other companies tend to be. They even still mix "licensing" with "usage" so you have to pay for something to then be allowed to pay for using it...

andrewstuart Nov 25, 2022

No, this is my actual experience using azure.

gtirloni Nov 25, 2022

I can go on Azure right now and create an instance and nobody will check anything manually and email me back something. Maybe you're confusing Azure with some other small town colocation provider.

tsimionescu Nov 25, 2022

If you want 1 instance, you're right. If you want 10 - 20 instances of one type in a region, the other poster's experience matches my own: you have to open a support request to ask for a quota increase, and that is not an automated process.

deathanatos Nov 25, 2022

Accounts have instance count quotas; you can get them raised, but it is a support ticket to do so.

And sometimes, that is hard. I've had Azure support not able to understand what quota they need to raise / what quota is being requested. I had to at least link them to their own documentation on it… (partly the confusion is that quota support tickets allow selecting the quota as a piece of metadata on the ticket, but only for some quotas, and of course, mine was for one of the ones not listed. Why they don't just list all of them is anyone's guess.)

andrewstuart Nov 25, 2022

Nope, I went through this process exchanging more than 30 emails trying to get the instances I wanted.

wstuartcl Nov 25, 2022

"an instance" lol

victor106 Nov 25, 2022

I am sorry to say but at this point Azure is so f’ed up I think it should only be considered after AWS and GCP.

The documentation is terrible and the Azure portal is so slow and laggy I can’t even believe it. Not to mention how unreliable their stack is.

arecurrence Nov 25, 2022

This is not as rare as public clouds may lead people to believe. I have had to move workloads around since AWS began (even between public clouds on occasion).

In particular, GPU availability has been a continuing problem. Unlike interchangeable x64 / arm64 instances with some adjustments based on the new core and ram count... if no GPU instances are available then I simply cannot run the job. AMD's improved support has increasingly provided an alternative in some situations but the problem persists.

I recommend doing the work to make the business somewhat cloud agnostic, or at the very least multi-region capable. I realize this is not an option for some services that have no equivalent on other clouds but you mentioned databases and k8s clusters which are both supported elsewhere.

andrewstuart Nov 25, 2022

GPUs are better run in your own office.

All cloud providers charge much, much more for GPUs than if you run a local machine.

Cloud GPUs are also a lot slower than state of the art consumer GPUs.

Cloud GPUs: much slower, less available, much more expensive.

pclmulqdq Nov 25, 2022

This is generally true for all accelerators (I work with cloud and on-prem FPGAs for my startup, Arbitrand).

However, lots of people only need those accelerators once in a while, so time sharing (aka cloud computing) makes a lot of sense and saves a ton of money overall. For FPGAs and some compute GPU applications, not having to handle support for your accelerators is also nice.

bushbaba Nov 25, 2022

Say you want 100 GPUs all inter connected to your multi petabyte data lake that’s being fed by your production workload.

Sure you could buy all that equipment but I’d wager it’s cheaper, more agile, and greater velocity from it being in the cloud

cosmic_quanta Nov 25, 2022

I would argue that the cost profile is different.

Local GPUs are a big up-front cost. But assuming that your workload is stable, in the long run I think local GPUs ends up being cheaper per-hour than cloud.

For startups, it doesn't make sense to make the up-front purchase, fine. But if you're optimizing for long-term (amortized) costs, I'd be curious if cloud is cost-effective.

macrolime Nov 25, 2022

The long run being three months

bushbaba Nov 25, 2022

An nvidia dgx box is roughly 40k. And that’s not including power/storage/rack Space.

But yes, If a single workstation can meet your gpu training needs then it’ll be cheaper with sufficient usage

discordance Nov 26, 2022

For small orgs this makes sense, but this really depends on how big your data sets are that you're training against and how your ML Ops / Data Ops is set up.

GPUs are better run close to your data. If you're training on-prem then your data needs to be on-prem too.

dehrmann Nov 25, 2022

You want to be in a position where you can spin up in a nearby region and pretend it's local and have things be good enough for a while. Properly building out multi-region is hard, and multi-cloud isn't worth it because it improves how you handle rare events (where half the internet is already down) with ongoing operational toil.

mmcconnell1618 Nov 25, 2022

I used to be a technical seller for Azure. This situation is obviously not great for you as a customer but there are proactive steps you can take to prevent this going forward. Reach out to your sales team and work with them on your roadmap for compute requirements going forward. The sales team has a forecast tool that feeds back into the department that buys and racks the equipment. If you can provide enough lead time, they will make sure you have compute resources available in your subscriptions.

wstuartcl Nov 25, 2022

What you describe is like the inverse of 90% of the reason companies host in the cloud. What makes needing to forcast and reach out to a sales guy to eventually stock hardware for your needs (while now competing against other customers for those resources) any better than hosting on prem.

AWS for sure has had resource constraints in different AZs (especially during flack Friday and holiday loads) but I have never had an issue finding resources to spin up especially if I was willing to be flexible on vm type.

mmcconnell1618 Nov 25, 2022

Under most circumstance, this isn't needed unless you have a big ask. Say, you need 1,000 specific cores and GPUs, then this process is the best way to ensure you have them available.

The original poster probably has the ability to spin up other instance types in their region. If there is no compute capacity in the entire region, something went wrong operationally.

I'm not suggesting you should put in a request for every new resource you need, but if you have a specific instance type or a large number needed, it helps. You're not losing the ability to shut them down the next day if you don't need them, you're just telling the Azure team that you expect to spin some up around a certain time. If you're making a significant request of compute capacity, the team has the ability to reserve those instances for your subscriptions so that you're not competing with others for those cores.

dustedcodes Nov 25, 2022

Why work with a human in the Azure sales department and plan cloud resources a year ahead? What’s the point of the cloud at this point? Then it just becomes a 100x more expensive version of hiring an infrastructure person and plan with them your own physical resources a year ahead.

janober OP Nov 25, 2022

Thanks a lot, that is very helpful and great to know! Def. something we will do in the future.

alexeldeib Nov 25, 2022

What VM sizes?

Besides what’s already been said, internal capacity differs HUGELY based on VM SKU. If you need GPUs or something it’ll be tough. But a lot of the newer v4/v5 general compute SKUs (D/Da/E/Ea/etc) have plenty of capacity in many regions.

If changing regions sounds like a pain, consider gambling on other VM size availability.

(azure employee)

janober OP Nov 25, 2022

Actually nothing fancy, for sure no GPUs. Just Standard_E4s_v4.

alexeldeib Nov 25, 2022

Ah, bummer. If it helps, you can try this to list out VM sizes with comparable capabilities and see if you have better luck with any others (--all not really necessary since it filters by NotAvailableForSubscription and similar):

  az vm list-skus -l germanynorth -r virtualMachines --all > germanynorth.json
  jq '.[] | select( any( .capabilities[]; (.name == "vCPUs" and (.value | tonumber) >= 4 )) and any(.capabilities[]; (.name == "MemoryGB" and ( .value | tonumber ) >= 32) ) )' germanynorth.json

4/32 because that's what E4s_v4 would have.

janober OP Nov 25, 2022

Thanks a lot! Just checked internally. Apparently are there some instances which we could get but would not work cost wise (have for example a lot of CPUs but we mainly care about RAM). Additionally, is there also still a region-wide CPU limit that would still cause us problems. So sadly not a long-term solution. But thanks a lot!

fock Nov 25, 2022

looking at the time you seem to spend on this issue and the fact you're apparently only needing low double digits of those instances.

Are you really sure you shouldn't just buy a bunch of machines (500cores/2TiB go for ~60k€), throw them into a colo and then spend that time on actually doing stuff?

cyptus Nov 25, 2022

same goes for every App service, no matter which instance size

whalesalad Nov 25, 2022

> We never thought our startup would be threatened by the unreliability of a company like Microsoft, or that they wouldn’t proactively inform us about this.

Yikes, this is totally the first thing you need to come to expect when working with MSFT.

scotty79 Nov 25, 2022

When Amazon S3 was a new thing, when I managed to convince my company move to it, when we just moved to serving some of our stuff from S3, first week, Amazon has an outage.

janober OP Nov 25, 2022

Probably a good learning for the future ;-)

jenscow Nov 25, 2022

Maybe Microsoft had just got their AWS bill?

usgroup Nov 25, 2022

Well I thought that was funny :-)

DannyBee Nov 25, 2022

Most of Europe expects the winter to be quite painful from a power perspective. It would not be surprising if cloud providers (major power users) are being asked to not increase (or even decrease) power usage.

The timeframe they gave would match that kind of ask.

I wonder whether you see the same behavior from other cloud providers there (ie if you ask them whether new capacity is available, what do they say)

arcturus17 Nov 25, 2022

> It would not be surprising if cloud providers (major power users) are being asked to not increase (or even decrease) power usage.

I doubt it. It will be easier - and probably safer - to ask citizens and physical industry (eg, factories) to bear the brunt than to risk having problems in critical IT infrastructure. Ask people and factories to turn the heat 3 degrees down and the effects will be more or less predictable. Asking to shut compute power down at random will have unpredictable consequences.

Nemo_bis Nov 25, 2022

It's not about shutting down existing machines. The power grid operator might be less willing to approve upgrades to serve increases in capacity. (No idea whether that's the case.)

arcturus17 Nov 26, 2022

That makes more sense.

Nemo_bis Nov 25, 2022

Someone else confirmed my guess https://www.hackerneue.com/item?id=33744179

analyst74 Nov 26, 2022

Obviously Azure failed its customer here, but everyone with data centers in Europe is tightening their belts and preparing for the worst.

I suspect AWS and GCP just have more headroom in EU.

andrewstuart Nov 25, 2022

Message to cloud providers:

List what you do you have available so we can choose.

Do not force users to randomly guess and be refused until eventually finding something available.

RajT88 Nov 25, 2022

Imagine if they did this in realtime. There's already DDOS attacks happening which are abusing the cloud free trials at scale - this would give them another attack vector.

I can see why they wouldn't want to do this.

nerdbert Nov 25, 2022

It only has to be available to paying customers. Or even customers over a certain paid usage threshold.

crmd Nov 25, 2022

I need big m4n instances with 100gbe for product demos, and spinning them up lately is like trying to get Taylor Swift tickets on Ticketmaster. We end up wasting money running them for days at a time instead of on demand because we’re afraid of losing them.

It’s infuriating that AWS doesn’t have an API that returns a list of AZs with available inventory for a given instance type.

andrewstuart Nov 25, 2022

Why not run them elsewhere?

There’s lots of providers apart from AWS/Azure/GCP.

Or buy a machine and put it in your office.

Self hosting can often be cheaper and more available and probably faster than using a cloud.

crmd Nov 25, 2022

We used to run demos on a local hardware cluster, but we found that prospective customers were reacting negatively to demos that were not on the same platform they would be running in production (AWS).

senderista Nov 25, 2022

I’ve had great experiences running bare metal instances on packet.io but haven’t used them since the acquisition. For accurate benchmarking it was fantastic (and much cheaper than EC2 bare metal instances).

sammy2255 Nov 25, 2022

Which regions are you trying?

layer8 Nov 25, 2022

Why would they make any promises, or be upfront about their resources at the risk of becoming less attractive compared to competitors with more resources? It’s not like many people are shunning the cloud for that reason today (although maybe they should).

bushbaba Nov 25, 2022

Your price point and the clouds margin is tied to not sitting on lots of unused instances. you want there to be adequate capacity not excessive capacity

layer8 Nov 25, 2022

It goes both ways: cloud providers don’t want to make promises about capacity, and cloud users don’t want to make promises about usage.

I don’t know about price point. Dedicated servers can be cheaper than cloud in many cases, if you have the appropriate know-how, and the cloud business is very profitable for a reason.

Too Nov 26, 2022

At bare minimum there should be a feature like "Give me any VM that closely resembles an E8s_v5 with at least 32G ram". Or "anything from these 4 approved types".

I don't always care if you give me a E8_v4 or a D8 instead, just give something. With all the 100 of variants of VMs that are available, finding an exact match is obviously an unnecessary constraint. Maybe they already simulate this behind the scenes, I don't know, though given the sizes are advertised with HW capabilities I'd imagine they can't really simulate a v4 using a v5 and vice versa.

Only place I've seen compute be treated this fluidly is in Container instance, which is a bad choice for many many other reasons.

avereveard Nov 25, 2022

https://aws.amazon.com/ec2/spot/instance-advisor/

It's not the exact metric but you can find which have more availability without knowing the exact number (which is constantly changing anyway)

robertlagrant Nov 25, 2022

Interesting semi-confirmed anecdote: when lockdown hit, Azure began to refuse to allocate servers. One of the main reasons was they prioritised servers in this way:

1. Government/health/defence cloud customers

2. Teams, which was exploding in use and they wanted to capitalise on it

3. Regular cloud customers

isoprophlex Nov 25, 2022

Yeah this was real. I remember this. For a while they selectively deprioritized customers, like you say. I'm not judging, just confirming the observation.

dszoboszlay Nov 25, 2022

Good news is that today is Black Friday, so the e-commerce industry is running at peak capacity. In 30 days it will be Christmas, and by then (the very latest!) everybody will scale back, so you have a good chance to gain access to more compute before you reach the end of your runway.

ttrrooppeerr Nov 25, 2022

> We never thought our startup would be threatened by the unreliability of a company like Microsoft

You will be threatened by your own unreliability of building something that's dependant on one region or one cloud.

websap Nov 25, 2022

This is an insidious argument to make. When building a startup you should choose 1 reliable cloud provider and use their best practices to support high availability.

pclmulqdq Nov 25, 2022

No matter the provider, their best practices all say to be multi-region.

manv1 Nov 25, 2022

Multi-region in AWS means building it yourself.

I suspect that the skills for real HA are atrophying because for 99% of the people multi-AZ is enough and most of the AWS stuff supports multi-az automagically.

The problem with multi-region is that it means configuration, and there are probably lots of services that you can't actually configure to be multi-region. Cognito is one off the top of my head. It looks like the various aurora flavors do multi-region, but what about Neptune? SQS? API Gateway? AWS Lambda? MediaLive?

Maybe you can hide all that behind DNS failover, maybe you can't.

Real multi-region is basically means going back to old-school HA, and that was hard to do when it was your data centers. On AWS it'll be even harder.

That isn't to say it's not possible, it's just a tremendous amount of work.

I mean really, if us-east-1 is down 80% of the internet is screwed...so from an expectations point of view does HA of your particular service matter if that happens? Even for a financial outages happen.

Once you have enough people it might be worth it. For a non mission critical startup? No fucking way.

Spivak Nov 25, 2022

Multi-AZ and then grow into multi-region if the need arises. Multi-region is a huge lift the moment all your data must live in two regions simultaneously. Very few shops are experienced enough to run clusters across datacenters in a way that can handle the unhappy paths.

websap Nov 25, 2022

Def not true with AWS, unless you reach a particular scale. Not for product market fit. My technology choices would be fully managed services so I could focus on my actual business.

bradknowles Nov 25, 2022

Read the "Well Architected" paper. Go multi-region.

3 More Comments →

Brian_K_White Nov 25, 2022

def true with everything. what a ridiculous statement.

gtirloni Nov 25, 2022

Cross-region architecture will be the first thing you hear about.

janober OP Nov 25, 2022

Totally agree, we could for sure have build from the get go multi-region and multi-cloud but we had good reasons not to do it. Depending on the product, technology, ... would actually also strongly recommend almost every startup to do the same.

PaulHoule Nov 25, 2022

No matter what stage of a service you are at you should have a documented procedure (ideally running a script) that can stand up a working instance of the system.

This has vast benefits for agility and fast development when developers are not always fighting the build system and have a "no fear" attitude about deployment.

If you have that, you can build a system in another region and be able to migrate wholesale to another region with more capacity and not be particularly concerned about the general problem of coordinating the service across multiple regions at the same time.

whimsicalism Nov 25, 2022

Seems bold to recommend everyone do the same as you when you are running in to problems you can't solve because of this exact choice you made.

janober OP Nov 25, 2022

I am still 100% sure it was exactly the right decision. Was however in hindsight probably the right one to choose Azure and/or that data center.

ComputerGuru Nov 25, 2022

If you go under because of this, will you still be 100% sure?

Everything is for sure until it’s not.

manv1 Nov 25, 2022

A startup is about managing risks and spending your time/money appropriately. Your cloud provider running out of capacity isn't an obvious risk especially if it's just capacity for general compute.

For some clouds that seem to be run on a manual process (IBM, Oracle) that would be expected, since they're sort of clunky. For other places (Rackspace, etc) it would uncommon. For a major provider like Azure, well, it's bizarre. I mean, the whole point of cloud is that it's all-you-can-eat.

You would think that this would be something they would advertise/talk about up-front. But who would sign up if that was disclosed?

deathanatos Nov 25, 2022

I've seen this before. I think it was in us-west1, ran out of VMs of the size we used for CI. Had to move to a different region. (Never moved back…)

It is shocking to me that it happened at all. Capacity planning shouldn't be so far behind in a cloud that wants to position it as being on-par with AWS/GCP. (Which Azure absolutely isn't.) To me, having capacity planning be solved is part of what I am paying for in that higher price of the VM.

> We never thought our startup would be threatened by the unreliability of a company like Microsoft, or that they wouldn’t proactively inform us about this.

Oh my sweet summer child, welcome to Azure. Don't depend on them being proactive about anything; even depending on them to react is a mistake, e.g., they do not reliably post-mortem severe failures. (At least, externally. But as a customer, I want to know what you're doing to prevent $massive_failure from happening again, and time and time again they're just silent on that front.)

plantain Nov 25, 2022

I'm baffled to read stories that suggest Azure is a viable competitor to GCP/AWS - they're an absolute nightmare on capacity.

It took me six months to get approved to start six instances! With multiple escalations including being forcibly changed to invoice billing - for which they never match the invoices automatically, every payment requires we file a ticket.

Moissanite Nov 25, 2022

Azure Germany is a separate partition from the rest of Azure - presumably for compliance reasons. This is distinct from AWS, where Frankfurt is just another region, albeit one with high demand.

Terretta Nov 25, 2022

> AWS .. Frankfurt is just another region

Unlike GCP and Azure, all AWS regions are (were) partitioned by design. This "blast radius" is (was) fantastic for resilience, security, and data sovereignty. It is (was) incredibly easy to be compliant in AWS, not to mention the ruggedness benefits.

AWS customers with more money than cloud engineers kept clamoring for cross-region capabilities ("Like GCP has!"), and in last couple years AWS has been adding some.

Cloud customers should be careful what they wish for. If you count on it in the data center, and you don't see it in a well-architected cloud service provider, perhaps it's a legacy pattern best left on the datacenter floor. In this case, at some point hard partitioning could become tough to prove to audit and impossible to count on for resilience.

UPDATE TO ADD: See my123's link below, first published 2022-11-16, super helpful even if familiar with their approach.

PDF: https://docs.aws.amazon.com/pdfs/whitepapers/latest/aws-faul...

EE84M3i Nov 25, 2022

AWS has several different levels of region isolation.

There are aws region partitions - general, china, us gov cloud (public), us gov secret and us gov top-secret.

Inside a partition, there can be some regions that are opt-in - see https://docs.aws.amazon.com/general/latest/gr/rande-manage.h...

My understanding is that opt-in regions are even more isolated inside a specific partition for partition-global services like IAM and maybe some other stuff.

senderista Nov 25, 2022

There is a reason why GCP and Azure have had many more global outages than AWS. Fault isolation always entails some level of inconvenience.

robertlagrant Nov 25, 2022

> Unlike GCP and Azure, all AWS regions are (were) partitioned by design. This "blast radius" is (was) fantastic for resilience, security, and data sovereignty. It is (was) incredibly easy to be compliant in AWS, not to mention the ruggedness benefits.

Could you elaborate on this a little? We use AWS, but are evaluating OCI for certain (very specific) cases, and I'll love to know what questions to ask for comparison purposes.

Terretta Nov 25, 2022

You likely won't get anywhere asking Oracle questions, their sales is very good at (not) answering.

Here is how partitioned/isolated OCI is by design:

https://www.wiz.io/blog/attachme-oracle-cloud-vulnerability-...

While that's fixed, it speaks volumes to the architecture. Very little has changed since 2018: https://www.brightworkresearch.com/how-to-understand-the-pro...

As noted there, I'd argue OCI is more akin to Softlayer/Bluemix than to GCP, Azure, or AWS, but depending on your certain very specific cases OCI may still be appropriate.

my123 Nov 25, 2022

Cross-region extensibility points are few and far between. See https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso... for more details.

tjungblut Nov 25, 2022

Yep, it's run by the Telekom entirely IIRC from my time back at MSFT. Microsoft "just deploys" Azure on it.

option Nov 25, 2022

this: compliance plus lack of energy for new datacenter capacity. source: colleague who works at msft. they have a true crisis there and it will get worse.

lars_francke Nov 25, 2022

We have had this issue in and since 2018 https://www.opencore.com/blog/2018/6/cloud-has-a-limit/

That said: We also had this issue on GCP last month.

We found that all three (AWS) are unreliable in their own ways.

zxcvbn4038 Nov 25, 2022

I’m sure Microsoft is just as surprised as you are. Almost every European facility I ever worked with was constrained by either space or power so you had to be really on top of your capacity management. Facilities in the US seem to have unlimited power and floor space so you never have to deal with either issue.

chunk_waffle Nov 25, 2022

Who else has heard countless times something like "with company X's cloud platform you don't need to file a ticket and wait weeks for another team to provision a physical server, just spin some more up bro." The reality is you do, you've just outsourced the problem.

jeffbee Nov 25, 2022

EC2 us-east-1 is chronically stocked out, too. Black Friday is the worst day of the year for this. At work, we pre-allocated tons of EC2 machines we don't really need, to hedge against EC2 stockout coinciding with some kind of incident. Yes, we are part of the problem.

don-code Nov 25, 2022

In a former role, I used EC2 in us-east-1 to host the front door e-commerce site for a consumer electronics company. AWS suggested that we go through the Infrastructure Event Mangaement process (https://aws.amazon.com/premiumsupport/programs/iem/) for Black Friday and Cyber Monday, so that staff on Amazon's side could guarantee that they'd have capacity to run our system at its forecasted peak.

The strategy they helped us arrive at was two-pronged:

1. Pre-launch all needed infrastructure. Yes, for all their "cloud scale", it was actually suggested that we preallocate all of our servers the week before, rather than rely on autoscaling.

2. Order capacity reservations for all of those instances (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capa...). This ensure that, if any of those instances go bad, we'd be able to relaunch them without going to the back of the line, and finding out that there was no more compute capacity available.

Jamie9912 Nov 25, 2022

Part of what problem? I don't remember us-east-1 ever running out of instances

MildlySerious Nov 25, 2022

This is a bit tangential, but now might be a good time to experiment with raising the price of your product. It might extend the time you have until you have to stop accepting new users entirely, in case your migration is taking longer than needed.

jesseryoung Nov 25, 2022

Ran into a similar issue last year in the East US region. We contracted support and they gave a similar response. From my understanding talking to people who use AWS and GCP this isn't uncommon across cloud platforms.

While we could've just swapped a deployment parameter to deploy to another region, we opted to just use a different SKU of VMs for a short period and switch back to the VMs when they were available again.

We haven't seen issues since.

wstuartcl Nov 25, 2022

yeah AWS tends to have capacity issues during high volume periods like black Friday (I think this is now actually because most large users pre reserve a buffer pool of vms sitting unused) -- but I have never had an issue where AWS has told me there would be no capacity for months. Its usually swapping AZ or regions or being slightly flexible on your sku. And if you are sensitive to this and find it happening take a look at your sku loadout you may be choosing a very high demand vm type and shifting just slightly gets way more capacity.

^^ and by capacity I am talking like 10's or 100s of vms being available not 1.

RajT88 Nov 25, 2022

Get in touch with your CSAM. They will be able to get you assigned a capacity manager, if you don't already have one assigned.

It is the function of the capacity manager to help you plan ahead based on what the data center capacities look like going into the future.

Meet monthly with your capacity manager. Get representation across different technology interests - database, compute, storage, event hubs, etc. Don't ever skip these meetings.

steelframe Nov 25, 2022

> Get in touch with your CSAM

Well that's an unfortunate acronym collision.

RajT88 Nov 30, 2022

I thought for sure it was a military term (recalling SAM missiles), until I saw it in the news just today.

GODDAMN.

Sidebar: MSFT is the king of acronym collisions.

wstuartcl Nov 25, 2022

Not much better than "meet with Infrastructure in Nov to plan next years capacity and server purchases" for on prep -- has Azure really degraded down to this?

RajT88 Nov 25, 2022

It's quite a bit better than that, in fact. They talk to their customers to try and understand all the big deployments coming to understand if there is going to be a crunch at the region/AZ level.

I'd be surprised if other cloud providers aren't doing that in some form. I only have experience with Azure (so far).

andrewstuart Nov 25, 2022

Wow.

It’s crazy that this could be valid advice, but it is.

ig1 Nov 25, 2022

Ask your VCs/angels for help, this is the kind of thing they can definitely help with.

(Speaking from experience - one of our portfolio companies had a similar challenge and we used our network to get to one of the execs of the vendor involved)

janober OP Nov 25, 2022

Thanks a lot. Yes that is also something we are trying in parallel.

jiggawatts Nov 25, 2022

One of the biggest benefits of k8s is that you can easily mix in pools of different hardware types without a “rebuild”.

Something to try in scenarios like this is to add the “weird and wonderful” VM SKUs that are less popular and may still have capacity remaining.

For example, the HPC series like HBv2 or HBv3. Also try Lsv3 or Lasv3.

Sure they’re a bit more expensive, but you only have use them until April.

ethotool Nov 25, 2022

There is no such thing as unlimited when it comes to resources and/or scalability in the hosting market. You might want to find a local colocation provider, buy a few network switches and servers as a secondary production and backup environment for your startup. Deploying your own infrastructure gives you full control over your startup. Yes it will raise your overhead and yes it’s not cheap but for a sustainable operation it’s a requirement in my opinion. I currently use Azure but I also have my own deployment with my own IP addresses and ASN which I keep spare capacity and keep some important servers on there incase something happens with Azure. Definitely helps me sleep better at night.

option Nov 25, 2022

I’ve heard from a friend who works at Microsoft that due to energy crisis in Europe plus their data locality laws, Microsoft is indeed running short on datacenter capacity there and can’t do anything about it no matter how much they are willing to spend.

janober OP Nov 25, 2022

Very interesting. As mentioned in another post, I am sure they are not trying to screw us or anybody else up. Is after all for sure not in their interest. But not flagging that to users I do not get at all. I would expect to get at least a warning email a week about that, plus warning in the dashboard, but there was literally nothing.

option Nov 25, 2022

as I’ve heard it actually affects much more than Azure, also all their cloud-based productivity suite.

MattGaiser Nov 25, 2022

Couldn't even put up their own solar panels?

option Nov 25, 2022

lol, what?

omk Nov 25, 2022

While some may immediately run a comparison between Azure, AWS and GCP let it be noted that any cloud platform facing this and making it to headlines is not good for the cloud industry over all.

prmoustache Nov 25, 2022

The thing is: if cloud vendors struggle getting new machines, imagine your small company trying to order and get delivered new on-prem servers quickly.

I worked for a company that worked mostly on-prem until 1y ago and last time they had ordered machines availability from Dell was scarce with huge delays.

unreal37 Nov 25, 2022

I remember, in the early days of the pandemic, that Azure Australia ran out of compute too. It happens at the regional level.

Are you stuck only to the German region, and can't go to other European regions?

natch Nov 25, 2022

>We never thought our startup would be threatened by the unreliability of a company like Microsoft

Had you never heard about (and this is unfortunately not a joke) Microsoft’s music service they once had, shut down after a few short years leaving customers without the ability to listen to the music they had paid to listen to?

The service was called, this was the trademarked name, Microsoft “Plays for Sure.” You cannot make this stuff up.

userbinator Nov 25, 2022

That's also the name of the DRM system it had.

pwarner Nov 25, 2022

Azure, despite being smaller than AWS, I think has more regions. So each one must be smaller, which likely means less spare capacity.

I also sort of suspect the spot market is less robust there. Lots of Azure is lift and shift on premises workloads, and those aren't using spot. Without people using spot, it's even harder to have spare capacity...

pclmulqdq Nov 25, 2022

Azure uses much smaller datacenters than AWS or GCP. Microsoft wasn't a big compute user before cloud, and it's a lot easier to manage and build for smaller DCs. Amazon and Google both needed huge DCs before being clouds.

cyptus Nov 25, 2022

we have the same issue and escalated it through multiple azure teams.

our quota has been silently set to 0 while there where still instances running. this worked fine until auto-scale scaled the instances down in the night to 1. at the start of the day auto scale was not able to scale back up to the initial amount which did lead into heavy performance issues and outages. we needed to move the instances as azure support did not help us. after many calls with azure and multiple teams involved we finally did not get the quota approved (even if we did have it already and was not asking for „new“ quotas).

also we decided to not be able to host in the German azure region anymore. Even if we could get the quota this is a business risk we don’t want to bear anymore to not be able to scale for unexpected traffic.

this is huge for us as our application requires German servers. We are still in research where to host in future.

cyptus Nov 25, 2022

Interesting is that you can get instances in dev/test subscriptions without any trouble.

YaBa Nov 26, 2022

Sad to ear that, but people have a wrong idea about the cloud, it's just others people hardware and like everything, there's a limit.

They cannot warn you because it's very hard to predict how many new customers will come or if the existing ones will create more instances.

I know about a bank with the same issue, basically, they've hogged all the resources in a specific region and yet, they need more. Unfortunately this things take time, MS cannot setup a new datacenter in a couple of days.

>but setting up in a new region would be complicated for us.

Why? it's easy: https://learn.microsoft.com/en-us/azure/azure-resource-manag...

Latency issues from app to DB?

jedberg Nov 25, 2022

> but setting up in a new region would be complicated for us.

I've never done K8 on Azure, but my understanding is that Azure is pretty good about coordinating between your own datacenter running windows and Azure. Maybe you can spin up some windows boxes in a cheap datacenter to make it work?

toomuchtodo Nov 25, 2022

Hetzner has a German presence I believe, and would work for running k8 on bare metal for n8n to burst to temporarily for running their orchestration and/or workflow runners. Might even be cheaper in the long run versus a cloud provider. Just gotta wire up the helm charts, containers, and whatever message bus is pushing their messages around. Can write to blob storage from anywhere if that’s a component of the app.

zwaps Nov 25, 2022

> Hetzner has a German presence I believe

I sure hope so, as a German company

vaderade Nov 25, 2022

Yes, my company found this out trying to add both a database and a serverless app to our existing infrastructure in Germany West Central in July. They had no ETA for more GWC capacity back then and told us to move to the North and West Europe regions.

Mave83 Nov 26, 2022

Just don't trust in marketing and save yourself a lot of money. On prem for all base or long term (6+ month) resources. Cloud only for peaks. And never use single cloud providers dependent features. Then you will never have such troubles at all.

SergeAx Nov 26, 2022

"There's no such thing as cloud - it is always someone's else computer". Although we may try to rely on the unwillingness of the cloud provider to lose revenue, probability of events like this can never be fully discounted.

rockylhotka Nov 25, 2022

My understanding is that the German region is not run by Microsoft, but a German company. This provides a legal shield required by Germany to try and prevent the US government from accessing data on those servers.

robjan Nov 25, 2022

We've been having this problem in Singapore for a couple of years now. Can't add any VMs to our k8s cluster and can't provision a number of services which made our multi-region BCP more complicated.

janober OP Nov 25, 2022

Years?!?! Guess I then have to be happy that in our case it is "just" around 4 months.

aftbit Nov 25, 2022

>These problems seem only in the German region, but setting up in a new region would be complicated for us.

This seems like your fundamental problem. If you design an architecture that is limited to a single region of a single cloud provider, you are very likely to encounter issues at some point.

Luckily you have a full month to solve this problem before it will prevent you from accepting new users. My suggestion is to start making your app multi-regional or multi-provider ASAP.

dharmab Nov 25, 2022

I worked in a top 15 Azure customer. This is not unusual at all, especially in the newer regions. Talk to your TAM before you make attempt major capacity changes in a region. They may have advice on specific SKUs to use or which zones have capacity (e.g. when austrailaeast was being built 80%+ of the capacity was in one zone for many months).

If you aren't a big spender you may not have a TAM who can get this info for you. Welcome to Azure.

usgroup Nov 25, 2022

Perhaps it’s a per customer limit to ration capacity? If so maybe you can legitimately work around it by creating multiple Azure billing accounts.

janober OP Nov 25, 2022

Could be possible. But as far as I know would two accounts in the same data center not work for us for technical reasons.

a99c43f2d565504 Nov 25, 2022

While you're at it at making your "infrastructure as code" cloud agnostic perhaps take a look at tools like Terraform (the only one I'm familiar with). I've just started the work of defining whatever we need to provision in their notation with the objective that it can be done with a single push of a button in the future.

scarface74 Nov 25, 2022

There is nothing “cloud agnostic” about using Terraform. Anyone who says this has no experience actually trying to implement it.

Terraform has different providers for each cloud provider and the code is not transferable any more than saying if you use Python to script your infrastructure it will be transferable.

robertlagrant Nov 25, 2022

Agreed. I've advised people same before. You can build to Kubernetes cluster-agnostic (mostly), but the stuff that gets you to that point will be very cloud-specific.

The reason for Terraform, and it's a good one, is your Terraform-related tooling doesn't have to change, e.g. if you route all your infra change approvals through Terraform Cloud), and you can coordinate multi-service changes, e.g. update Auth0 infra to do X, then AWS to do Y.

janober OP Nov 25, 2022

It has actually been done that way. For technical reasons is sadly a move to a new data center even with that very complicated and time consuming.

Epa095 Nov 25, 2022

In Norway East Azure were incapable of provisioning new VMs for several(4-5) days, caused by some IP issue. The only solution was 'try to provision in the night, and don't turn it off if you get one'. Their status page showed green through the whole period though, even though nothing needing compute worked. So that was cool....

kccqzy Nov 25, 2022

Stockouts have happened on both AWS and GCP too. Most of the time the problem is no longer a problem if you build your infrastructure not to rely on a single region or availability zone. On EC2 especially, even if you can't change to a different region, try changing to a new instance type and that might work.

foxandmouse Nov 27, 2022

Is this related to the hardware shortage during the pandemic? I'm assuming they couldn't scale at the rate which they intended pre-pandemic..

This seems like a much larger issue than they're making it seem. The promise of the cloud was unlimited scalability. I never thought of cloud resources as finite.

runamok Nov 26, 2022

I don't have much knowledge about azure but is it possible to add different instant types and/or sizes? E. g. in the EC2 world if AWS was out of m5.xlarge I would try to add a worker group with m6.xlarge or m4.xlarge. If that did not work I might try to replace my xlarge with 2xlarge...

fuzzy2 Nov 25, 2022

Sort-of. I have a Postgres flexible database in the West Germany Central region that can no longer be scaled. It was only created for testing purposes, so no biggie. The backend is basically a managed Compute resource.

If you need more reliability, I see only one way out: Go multi-region or even multi-cloud.

TexanFeller Nov 25, 2022

Infinite scaling clouds, they said. In AWS at work we spin up large numbers of EMR nodes and every few days get stuck waiting for availability of certain instance types in our region too. I guess we could reserve more, but that defeats a lot of scale up and down advantages.

mirekrusin Nov 25, 2022

Serverless runs out of servers.

ksec Nov 26, 2022

Worth repeating again, AWS, Azure and GCP are all adding capacity, and new Datacenter as fast as they could. We have enough demand to drive the next two generation of leading edge node. That is TSMC N3 and N2. And I assume it will be similar in N1.4 or 14A.

z3t4 Nov 25, 2022

I think there is some general rule in business that you should not depend on a provider that if they lost your business it would be less then one percent of their revenue. Or be ready when they drop the guillotine.

ComputerGuru Nov 25, 2022

For anyone getting started, that means no dependencies at all. Even colocating would be out of the question, according to your metric.

z3t4 Nov 26, 2022

It depends on how long it would take you to find another colocating company. If there is another co-location on the other side of the street you could simply take your computer there - then there is no dependency.

iso1631 Nov 25, 2022

In the olden days you use to buy computers from Dell and were well under 1% of their revenue. But if they dropped you as a customer, you bought them from HP instead, no problem.

jodrellblank Nov 25, 2022

Use Azure and AWS so that you're not dependent on either one.

(You could depend on another startup with no revenue).

rkwasny Nov 25, 2022

They basically have far too many small regions and are growing like crazy, multi-region deployments will be a must unfortunately.

Maybe you can spin up some part of the infrastructure that are not latency sensitive in the nearby region?

1-6 Nov 25, 2022

Infinite resources is only marketing and no hyperscaler on the market should ever promise that or give people that impression if they haven't accomplished scaling all throughout the entire supply chain.

rickette Nov 25, 2022

Reading these comments it looks like everyone runs into this all the time. As a counterpoint: never run into this on Azure, scaling up/down 20-30 vm's a day. Hope it stays that way...

lmeyerov Nov 25, 2022

As part of launching our global GPU edge network, we need to support low-volume regions, which means a small number of T4 gpu in different timezones. Azure ran out last Christmas, or at least refused us capacity, and is only adding the next tier of A10's (~2x+ costlier?). We haven't had as much of a problem getting GPUs of different grade on GCP + AWS. I get a form email every 2w from Azure IT that they are working on it. Not as much of an issue for bigger GPUs.

(Also... If into k8s, python, GPUs, graphs, viz, MLOps, working with sec/fraud/supplychain/gov/etc customers on cool deploys, and looking for a remote job, we are hiring for someone to take ownership here!)

jcmontx Nov 25, 2022

Why not just creating a bigger DB instance in another region for a few months? Sure, you'll take a performance hit, but 99% of users won't notice or care

janober OP Nov 27, 2022

Ah yes, that is what we did in the end for the database. But that is not our main issue, rather that we do not get any more instances for our k8s cluster and those we can sadly not just spin up somewhere else.

sabujp Nov 25, 2022

this is due to the energy crisis in europe caused by the war

api Nov 25, 2022

Maybe they are doing this to push people into regions with lower energy costs. Of course Northern Virginia or Canada is going to give you much higher ping times.

janober OP Nov 25, 2022

I do honestly not think there is any bad intent behind it. I am just surprised that this is happening at all (esp. not with a resolution time of multiple months). They must have known for a long time that this would happen, so I would have expected an early heads-up!

sird Nov 25, 2022

Interesting thought. It would be crazy if turning down business was preferable to just raising prices to reflect increased energy costs. I'm not a cloud expert, but maybe they don't have the infra to price differently in some regions?

Scoundreller Nov 25, 2022

That’s the problem with charging average costs (assuming they do that) but the new user costs are at the margin which can be muuuuch higher.

semicolon_storm Nov 25, 2022

Azure definitely has the ability to charge differently per region. They do it pretty frequently.

danpalmer Nov 25, 2022

Why wouldn't they just price higher in those regions? People want/need regions for policy and compliance reasons, not just for ping, particularly with Europe and Germany I'd expect.

4ndrewl Nov 25, 2022

And potential data residency issues

exelib Nov 25, 2022

M$ just don't want your money. We had experienced this problem many times in Irland and German regions. Never experienced it with Hetzner or AWS.

geonaut Nov 25, 2022

Any optimisations you can make? Will have the advantage of saving you money across all platforms/regions

janober OP Nov 25, 2022

Yes, some are possible and we are already doing that. Sadly will it only delay the time we run out of resources. If we would talk about a few weeks, we could for sure make it, but over 4 months is sadly not possible.

krmboya Nov 25, 2022

Damn! It's leaky abstractions again

AaronFriel Nov 25, 2022

Surprised to see no mention of T-Systems, the subsidiary of Deutsche Telekom, that operates Azure Germany.

Animats Nov 25, 2022

Is there a secondary market for reselling Azure capacity? Can you bid against other Azure customers?

HeavyStorm Nov 25, 2022

I'm having trouble getting a instance with GPU in east US, but that's always a problem.

Jamie9912 Nov 25, 2022

Not the first time this has happened to Azure, they are always under-provisioned. Move to AWS

aliswe Nov 27, 2022

Create a new nodepool/scaleset in another region (i think that should be possible)

orik Nov 25, 2022

What sort of nodes are you using, can you add a node pool with a different SKU?

lyind Nov 25, 2022

If this is a serious problem for your business, you use K8s and require assistance quickly moving your workloads, consider contacting:

https://www.giantswarm.io/

(I work at Giant Swarm.)

habibur Nov 25, 2022

Looks like github is down right now. Or is it only me?

unixhero Nov 25, 2022

Big fan of n8n!

janober OP Nov 25, 2022

Thanks a lot! That is always great to hear!

somenewaccount1 Nov 25, 2022

If you want help duplicating your k8s cluster workload, hmu. I love K8s and love contract work. $45/hr. Good luck!

NicoJuicy Nov 26, 2022

Same issue in France fyi

Kalanos Nov 25, 2022

it's a european site during the world cup haha

teaearlgraycold Nov 25, 2022

And people roll their eyes when I say I’m dedicated to AWS.

jakear Nov 25, 2022

People saying "shame on you for not being multi-region" are missing the point: This is a German company with German customers subject to German data residency laws. For them to store German data in a region besides Germany requires getting informed consent from the "data subject", who must be "pre-informed about the potential risks involved in cross-border data transfer". [1] This is why Azure has a dedicated German partition, just as it has a dedicated Chinese partition.

Now, they could go the GDPR/Cookies route and prompt absolutely every user on pageload, but doing so would annihilate the purpose of the law into monotonous smithereens, just as it did with Cookies. Good on them for defaulting to the "more secure" mode, but yes this is a potential consequence.

Happy to hear from any German amigos present if I've got something wrong. (But watch out... you might be putting HN at risk - their servers aren't (likely) in Germany!)

[1]: https://incountry.com/blog/which-german-data-privacy-laws-yo...

dehrmann Nov 25, 2022

This is a good point, but it's a reminder that a lot of these privacy laws are impractical to deal with. When they're universal, it's one thing, but if you're a medium-sized country trying to flex your legislative might, you're going to make the experience worse for your citizens and businesses.

solatic Nov 25, 2022

No reason why Azure can't run multiple "regions" within Germany proper. Their current region is in Frankfurt; no reason why they couldn't launch regions in Munich or Hamburg as well. Then German companies could go multi-region while staying fully compliant with German data sovereignty laws.

dh2022 Nov 26, 2022

There are already two Azure regions in Germany: Germany North and Germany West Central.

kthejoker2 Nov 26, 2022

Really? "No reason"? You can't think of one rea$on?

mikerg87 Nov 25, 2022

Also note that the German Azure is in transition - I am not 100% sure the new version of Azure unique to Germany is set up just yet - held by a specific data trustee is ready...

https://learn.microsoft.com/en-us/previous-versions/azure/ge...

pclmulqdq Nov 25, 2022

Time to sign up for AWS or GCP, then. If you're using kubernetes anyway, you'll be fine with the switch.

dilyevsky Nov 25, 2022

I’ve had capacity issues before with both gcp and aws in smaller regions so not a panacea

scarface74 Nov 25, 2022

Says someone who has never done a large migration of any type…

foobiekr Nov 25, 2022

You should be good to go except for debugging accounts billing monitoring habits documentation security evaluation …

purebscloudoff Nov 25, 2022

The Batch Service schedule history monitor sucks. It is inaccurate and doesn't sync the job order correctly. You can call them, they will get on the phone and then say they fixed it. Then you call them again because they didn't and they give you the same answer. Can't blame them, most of them are on H1B's. Nobody wants to be the squeaky wheel in that position. So you will just get the runaround all the time.

trasz2 Nov 25, 2022

>We never thought our startup would be threatened by the unreliability

Daily reminder that cloud services are vastly less reliable than traditional hosting; it’s just that they manipulate the terminology to deflect that, replacing reliability with availability, aka “making impression of working”.

calltrak Nov 25, 2022

I am so glad we made the decision to pull https://Bigger.Bio off azure a while ago. It was nothing but problems on their platform.

hgsgm Nov 25, 2022 (dead)

donedealomg Nov 25, 2022 (dead)

Haga Nov 25, 2022 (dead)

ErnesTechDotCom Nov 26, 2022 (dead)

vipull Nov 26, 2022

I’M. having a tough time ALso, with microsofft.

They seem to IIgnoRe, then repent.: finally APologgise.:(

I think u should switch to a new COMpuute. GCc.-pp.??

When we were running our own compute back in 09: and resources ran out or were unreliable, we cld shOUt at the server maintainer and/OOr install better hardware oUUrselves. NOt-THE.case anymore.:( :((

-Vip

wizwit999 Nov 25, 2022 (dead)

jobhenri Nov 25, 2022

help

choward Nov 26, 2022

Ha. I knew something like this would happen eventually. Isn't limitless scalability one of the biggest selling points of using "the cloud"? If you have to buy your own computers anyway why even use the cloud? You could try using different clouds providers but eventually the clouds run out.

Which brings me to another important point. If we run out of computers meaning supply can't keep up with demand, then who are the winners? The people who own the computers. Cloud providers and self hosters. Because of the high demand cloud providers can raise their prices and that's directly converted to profit since expenses remain the same, i.e. price gouging. Good job all you cloud loyalists who use the cloud for everything.

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous