Comment by diggan - Hacker Neue

diggan Jun 23, 2025 parent

Seems the money shot starts at page 131:

> The ultimate cause of the peninsular electrical zero on April 28th was a phenomenon of overvoltages in the form of a "chain reaction" in which high voltages cause generation disconnections, which in turn causes new increases in voltage and thus new disconnections, and so on.

> 1. The system showed insufficient dynamic voltage control capabilities sufficient to maintain stable voltage

> 2. A series of rhythmic oscillations significantly conditioned the system, modifying its configuration and increasing the difficulties for voltage stabilization.

If I understand it correctly (and like software, typical), it was a positive feedback-loop. Since there wasn't enough voltage control, some other station had to be added but got overloaded instead, also turning off, and then on to the next station.

Late addition: It was very helpful for me to read through the "ANNEX X. BRIEF BASICS OF THE ELECTRIC SYSTEM" (page 168) before trying to read the report itself, as it explains a lot of things that the rest of the report (rightly) assumes you already know.

leymed Jun 23, 2025

I think your interpretation is correct. The voltage control is done at the high level of the grid, meaning the control covers bigger generation stations and major substations. Even if it’s small generator, rotating machinery, you won’t have strict voltage control other than its own AVR. The problem I see here is that we embed smaller individual generations at the lower level, where they pump the generated power to the grid at the medium voltage level. When you have majority of your generation at this level, you won’t have strict control over voltage and even frequency, I assume. I’m still digesting the report, but what I am after is whether they really neglected it and if it is not possible to do voltage control with 50% generation coming from renewable and through medium voltage level, aka lower level.

amluto Jun 24, 2025

I’m a bit mystified as to how the grid controls voltage at all. The non-renewable plants follow the rules here:

https://www.boe.es/buscar/doc.php?id=BOE-A-2000-5204

7.1(b) seems to be saying that generators connected at 200kV adjust their reactive power generation/absorption in real time according to the voltage they observe, based on a lookup table provided by the grid operator.

This seems sort of sensible according to my limited understanding of the theory of AC grids. You can write some differential equations and pretend everything is continuous (as opposed to being a LUT with 11 steps or so), and you can determine that the grid is stable.

However, check out this shorter report from red eléctrica:

https://d1n1o4zeyfu21r.cloudfront.net/WEB_Incident_%2028A_Sp...

Apparently these 220kV plants are connected to the 400kV grid via transformers in substations that are not owned by the generator operators. And those transformers have “tap changers” that attempt to keep the 220kV secondary side at the correct voltage within some fairly large voltage range on the 400kV side. Won’t this defeat the voltage control that the 220kV generators are supposed to provide? If the grid voltage is high, then absorption of reactive power is needed [0], and the generators are supposed to determine that they need to absorb reactive power (which they can do), but if the tap changer changes its setting, then the generator will not react correctly to the voltage on the 400kV side.

In other words, one would like the generator to absorb reactive power according to P_reactive(primary voltage • 220/400), but the actual behavior is P_reactive(primary voltage • 220/400 • tap changer position), the tap changer position is presumably something like 400/primary voltage, and I don’t understand how the result is supposed to function in any useful way. Adding insult to injury, the red eléctrica repoet authors seem to be suggesting that a bunch of tap changers operators didn’t configure their tap changes well enough to even keep secondary voltages in range.

Does anyone with more familiarity with these systems know how they’re supposed to work?

[0] I can never remember the sign convention for reactive power.

jakewins Jun 24, 2025

I don't claim to know the details of reactive power management, but the primary mechanisms for grid stability in the EU is the "cascade" of services the TSOs procures:

- Fast Frequency Response (FFR), sub-second power adjustment following frequency table

- Frequency Containment Reserve (FCR), ~second power adjustment following frequency table

- Automatic Frequency Restoration Reserve (aFRR), ~second energy production following TSO setpoint signal

- Manual Frequency Restoration Reserve, ~minute energy production following TSO activation signals

My understanding is the primary failure in Spain was that 9 separate synchronous plants that had sold aFRR(?) to the TSO then failed to deliver, so when the TSO algorithms tried to adjust the oscillations, nothing happened. Everything else was kinda "as designed".

pjc50 Jun 24, 2025

> 9 separate synchronous plants that had sold aFRR(?) to the TSO then failed to deliver, so when the TSO algorithms tried to adjust the oscillations, nothing happened.

Oof. This sounds like a classic of "it's only needed in emergencies, so it's only in emergencies that we find out it doesn't work".

jakewins Jun 24, 2025

I don't know about the Spanish market, but at least in the markets I'm involved in aFRR is an "always on" product, the TSO controls your plant with a setpoint that updates in near-real-time throughout the period you've sold to them.. it's not clear to me that the product that wasn't delivered was actually aFRR though, maybe it was something else less frequently called upon.

amluto Jun 24, 2025

It likes to me like a major factor was that the grid failed to control voltage, not frequency. Frequency control should be unaffected by transformers.

scrlk Jun 24, 2025

The automatic voltage regulator (AVR) of the generator and the on-load tap changer (OLTC) operate on different timescales (AVR = quick; OLTC = slow). OLTCs are typically set up with a voltage deadband and a time delay to prevent 'hunting' (i.e., repeatedly tapping up and down).

amluto Jun 24, 2025

At the end of the day, reactive power in and reactive power out need to add up to zero (keeping in mind that many components can add or remove reactive power and that reactive energy is not conserved the way that real energy is) or, equivalently, that the grid voltage needs to be in range. It’s not sufficient to merely keep the higher frequency components of the voltage in range.

So if the grid wants to be at 400kV, and achieving 400kV under particular generation conditions requires 1500MVar of reactive power absorption by the grid (I made up that number), and the grid operator is relying on 220kV conventional generators to collectively have 1000Mvar of absorption available under said conditions, then something needs to communicate that need to those generators so that they actually absorb those 1000Mvar. And if the OLTCs fool the control algorithm into causing those generators to absorb only 400Mvar, then there’s a mismatch, and that mismatch doesn’t go away because the OLTCs are supposed to be slow.

If, as the writeup seems to suggest, the grid design also requires the OLTCs to operate quickly under large voltage fluctuations because the secondary side cannot tolerate the same fractional voltage swing that the primary side is specified to tolerated, then I would not want to be the person signing off on the grid being stable. (Writing the simulator could be fun, though!). Maybe the idea is that, if the primary voltage is stable at 10% above nominal, then the OLTCs are intended to be stable at a position that holds the secondary at 5% above nominal, and that in turn is intended to result in the correct amount of reactive power absorption?

If I were designing this thing from scratch, I would want an actual communication channel by which facilities that can adjust their reactive power can be commanded to do so independently of the voltage at the point at which they’re connected. And I would want a carefully considered decentralized algorithm to use these controls which, as a first pass, would take input from the primary side at the relevant substations. And then I would want to extend a similar protocol to most or all of the little solar generators at customer sites (not to mention the larger solar facilities that don’t dynamically control reactive power at all in Spain) because they, collectively, can quickly supply or absorb large amounts of reactive power on demand. (Large facilities would use fiber. Small facilities would use digital signals over the power lines or, maybe, grudgingly, the Internet. We really don’t want a situation where the grid cannot start up without customer sites having Internet access.)

Or I would dream of a grid that’s primarily DC with AC islands where the DC portions don’t care about reactive power or frequency at all and merely need to control voltage and power flow.

scrlk Jun 24, 2025

What you're proposing regarding communication/control channels is already present: the system operator has SCADA links to transmission connected generators + an energy management system to help them control their system.

See Part C here (e.g., SOs having the ability to control generator setpoints): https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=uriserv:...

amluto Jun 24, 2025

I admit that link is rather long and I didn't read the whole thing, but I see a lot of real-time exchange of data, but the only voltage/reactive power control part I spotted was Article 22 1(c), which doesn't clearly say anything about real-time control.

And the original long PDF, page 21, mentions the use of Operating Procedure 7.4 Dynamic Voltage Control, and it very vaguely mentions programming of the RRTT (which seems to include the 7.4 schedule) the day before the failure and the day of the failure, but I didn't see anything about the operator programming the RRTT during the failure to control voltage.

It seems to be (and this is not any sort of control theory analysis) that, if the grid voltage is too high (in specified range, but high enough that tap changers must operate to avoid disconnecting generators) and additional reactive power absorption is needed, then the grid ought to react by operating the tap changers (because it's necessary) and by somehow instructing the generators to absorb additional power despite the operation of the tap changers. And I see plenty of discussion about the tap changers in the big PDF as well as plenty of discussion of data acquired via SCADA links, but I don't see anything about adjusting the reactive power schedules to compensate for the operation of the tap changers or about the use of any sort of real-time SCADA control to adjust reactive power.

madaxe_again Jun 24, 2025

I’ve just read the whole shebang.

While the overall reason for the mass failure you cite is correct - a cascading failure - the interesting bit here are the oscillations that lead to it.

It looks very much like this was driven by algorithmic volatility trading of electricity spots - overproduction, price goes negative, buys placed, production ramps in response to rising price, price rises, sells placed, production falls due to falling price. The period of the oscillations in the grid seen before the blackout suggest a relatively slow cycle, and what they describe in the report sounds very much like this was an interaction between price-driven supply and real world supply.

It does speak to there being inadequate storage available on the grid to smooth demand and therefore pricing, but it also suggests that in certain conditions a harmonic can be set up between the market and price-driven production with catastrophic consequences.

diggan OP Jun 24, 2025

> It looks very much like this was driven by algorithmic volatility trading of electricity spots

Yes, + less "reactive power stations" than expected was available (seems the day some unexpectedly went offline, and not enough safeguards/communication to realize this) + a switch between the French import/export that happened at the same time, leading to the overvoltage issue, which then spiraled.

As far as I read the report, there were multiple causes, not a single one like "algorithmic volatility trading of electricity spots" but a combination of the issues where one-by-one, things would have been fine but all together? Shit broke

This item has no comments currently.