Main page

Modern standby on FreeBSD (S0ix) ⚡

Reading time:12 min
Date published:1/11/2024

Background (S3 v. S0ix)

One of the main things still missing in FreeBSD for it to be usable on modern laptops is the ability to go to sleep. In the past, this was done using something called ACPI S3, but vendors have slowly been phasing this out in favour of something else called S0ix. FreeBSD does not support S0ix as of yet, leaving it without sleep support on these devices.

S3 is one of the global sleep states that ACPI defines (other examples include S0 when in regular operation and S5 when the computer is fully turned off). When you tell your machine to go into the S3 sleep state, the acpi_EnterSleepState function is called, which eventually tells your ACPI firmware to put your machine to sleep.

With S0ix, the system instead stays in the S0 global state, and the firmware only enters a low-power state when the CPUs are idle and some device power constraints are met, which the OS is responsible for ensuring. The x in S0ix denotes the specific low-power idle state the system, the deepest of which and our eventual goal is S0i3.

A fair warning: this article delves into the sombre depths and tedium of ACPI, so it's probably not the most exciting read. But here's a picture of Beastie snoozing to keep you company:

Beastie sleeping

I gave a presentation on this topic at FOSDEM 2025, which you can view here.

Does my laptop use S3 or S0ix? And what is s2idle?

On FreeBSD, you can query the sleep states your machine supports by reading the hw.acpi.supported_sleep_state sysctl (hw.acpi.suspend_state gives you the sleep state used for suspend). If you don't see S3 in the list, your machine probably only supports S0ix.

To be sure that your machine indeed does support S0ix, you need to check the FADT flags, specifically AcpiGbl_FADT.Flags & ACPI_FADT_LOW_POWER_S0.

Note that as of D48734, all ACPI machines will advertise s2idle as supported, which, although related to S0ix, does not imply that a given machine supports S0ix.

s2idle or "suspend-to-idle" is a "fake" sleep state which basically just means that you do all the usual setup to sleep your machine, except that you're just idling the CPU rather than actually entering a sleep state. Theoretically, this works on any machine, but it doesn't save all that much power on its own.

Instead, if everything has been set up correctly, the firmware will enter one of the S0ix states and will hopefully end up in S0i3 at some point when the OS is in s2idle.

What has already been done?

Ben Widawsky from Intel started work on this in 2018 with two patches, D17675 for suspend-to-idle support and D17676 for emulating S3 with S0ix. This work was never finished, however.

Debugging: LPIT v. _LPI, residency counters, and the AMD SMU 🐛

The LPIT (Low Power Idle Table, defined in this Intel spec) describes the low-power idle states that the CPU supports. These table entries also contain residency counters, which just tell you how long a CPU has spent in a particular low-power state, which is obviously useful for debugging.

It would seem as though LPIT has gone out of favour since ACPI 6.0. It says as much for ARM. It does seem like newer Intel devices still have the LPIT table but no _LPI objects (e.g. the Dell XPS 15 9570), whereas AMD laptops only have _LPI objects, which means both will have to be supported.

The ACPI spec now makes no mention of LPIT, but it does look like LPI is its replacement. Unfortunately, they made residency counters for each of these states optional, and it so happens that they are missing on my AMD Framework laptop.

Luckily, AMD chips have an SMU (System Management Unit, which you'll also see referred to as "MP1") core on-die which we can ask for residency information. This is a small LatticeMico32 microprocessor that runs power management firmware (PMFW) which also serves to actually decide whether or not we enter S0i3 and power goes to the CPU.

Initial support for this is added with an amdsmu driver in D48683, and residency counters are exposed as sysctls in D48714. We'll revisit the SMU later when we talk about the sleep process. Rudolf Marek has an interesting CCC talk about "Matroshka processors" as he calls them.

Dieshot of Matroshka processor on an AMD CPU. Credit to @Locuza_ on Twitter.

One last thing I'd like to touch on regarding debugging on AMD is the amd_s2idle.py script on Linux, which is very helpful in debugging the myriad reasons why a laptop may not be entering the deep sleep S0i3 state. I'd like to write something similar for FreeBSD at some point once S0i3 is actually working.

SPMC (System Power Management Controller) or PEP (Power Engine Plugin)

The SPMC or PEP - as far as I'm aware, these can be used interchangeably - is the primary device used for interacting with the firmware for S0ix. It uses ACPI ID PNP0D80 ("Windows-compatible System Power Management Controller"). For this, I have written a new acpi_spmc driver for FreeBSD in D48387.

It is useful for two main things:

This is done through DSMs (Device Specific Methods).

DSMs (Device Specific Methods)

In ACPI-speak, a DSM (_DSM object) is a sort of special multiplexed method for executing, well, device-specific methods. When you evaluate a _DSM object, you pass it a vendor-specific UUID as its first argument, a revision as its second, a function index as its third, and, finally, an optional package (== a vector in ACPI-speak) of arguments as its fourth. On FreeBSD, acpi_EvaluateDSMTyped is used to do this for you.

It seems like the original Intel spec linked above is not actually used in practice (UUID c4eb40a0-6cd2-11e2-bcfd-0800200c9a66), at least not on modern Intel or AMD platforms. Instead, there's Microsoft's DSM UUID 11e00d56-ce64-47ce-837b-1f898f9aa461, and thankfully is quite similar to the original DSM's, except with a couple extra "Modern Standby" functions and missing some others:

IndexDescription
Notes
0Enumerate functions
1Get device constraintsOnly in the Intel spec.
2Get crash dump deviceOnly in the Intel spec.
3Display off notification
4Display on notification
5Entry notification
6Exit notification
7"Modern Standby" entry notification
8"Modern Standby" exit notification

AMD seems to have their own DSM UUID e3f32452-febc-43ce-9039-932122d37721 along with Microsoft's one, for which I haven't really been able to find any documentation outside of the Linux implementation. This is what they look like:

IndexDescription
Notes
0Enumerate functions
1Get device constraints
2Entry notificationOn Framework laptops, this slowly fades the power button led in and out.
3Exit notification
4Display off notification
5Display on notification

A simplified pseudo-code example of calling e.g. the "get device constraints" function on AMD looks like this:

Arg0 = "e3f32452-febc-43ce-9039-932122d37721" // AMD's SPMC DSM UUID.
Arg1 = 0 // Revision zero.
Arg2 = 1 // "Get device constraints" function ID.
Arg3 = Package() // No arguments needed.
call_dsm(spmc_device, Arg0, Arg1, Arg2, Arg3)

On AMD platforms, we must use the AMD UUID for getting device constraints, which makes sense as Microsoft's DSMs don't have this. For some reason, though, the device constraints package returned by the AMD UUID follows a different format for which I couldn't find a spec anywhere 🙃

It seems like we need to use both the Microsoft and AMD UUIDs for the notifications (including the "Modern Standby" ones), though. We'll talk more about this later.

I don't know what exactly the situation is like on modern Intel platforms.

Going to sleep 💤

Okay, so what does the process for going to sleep actually look like? Broadly, we follow the following steps:

Putting devices to sleep

The first step is to put all the devices attached to the system to sleep themselves. These devices are things like USB peripherals, the GPU, any NVMe drives, &c. At minimum, to enter an LPI state, we must satisfy the device constraints gotten from the SPMC. In practice though, if we're going to sleep, we might as well try to save as much power as possible and attempt to put all devices to sleep.

An ACPI device has four five-ish power states, known as D-states: D0 (fully on), D1, D2, D3hot (off but still powered), and D3cold (off and with power completely removed). The distinction between D3hot and D3cold seems to be a relatively new one, and it's unclear which one "D3" refers to in the ACPI spec. See this PR I opened on the ACPICA GitHub repo discussing this, and the (WIP) D48384 revision for adding D3cold support to FreeBSD.

Switching between these states is done through the acpi_pwr_switch_consumer function on FreeBSD (a "power consumer" is just a device).

To set a device's D-state, one must first get the power resources required for that D-state through the _PRx (where x is the target D-state) objects (ACPI 7.3.8 - 7.3.11) and ensure they are all turned on. Conversely, the power resources for all higher-power states (i.e. lower-numbered x) must be turned off. Finally, the _PSx object is evaluated to actually set the device to the desired D-state.

A device only supports D3cold if it lists explicit power resources for D3 through a _PR3 object, in which case, keeping those power resources on transitions the device to D3hot and turning them off transitions it to D3cold.

There was an issue with turning these power resources off in FreeBSD, which I fixed in D48385.

Checking for device power constraint violations 🚓

Before we intend to go to sleep, it is useful to check that we're not violating any of the device power constraints gotten from the SPMC.

For this, we need a way to get a device's current D-state. I added an acpi_pwr_get_consumer function for doing this in D48386.

ACPI defines multiple ways of getting the D-state of a device. The first and simplest is through the _PSC (power state current, ACPI 7.3.6) control method, which simply spits out the device's D-state when evaluated. _PSC isn't implemented for all devices, however:

This control method is not required if the device state can be inferred by the Power Resource settings. This would be the case when the device does not require a _PS0, _PS1, _PS2, or _PS3 control method.

The "Power Resource settings" the spec mentions are our friends the _PRx objects. From these, we can infer the D-state of a device is as follows:

Then, it's just a simple matter of making sure the device's D-state is greater or equal to the one in the corresponding device power constraint package.

Sending display off and sleep entry notifications 🖥

There isn't much to talk about here. We just need to call the display off and sleep entry DSM functions on the SPMC. This is done in D48735.

Interrupts and GPEs 📣

This is a little tricky. In the next step we'll stop the CPUs in such a way that they can only be woken up by an interrupt. Lots of things could interrupt the CPU, so we'd like to mask out all interrupts which are not related to actually waking the system before going to sleep.

ACPI interrupts are done through system control interrupts or SCIs. The interrupt number for SCIs is gotten from AcpiGbl_FADT.SciInterrupt, and is usually interrupt number 9. So we first mask out all the interrupts except for the SCI:

register_t rflags = intr_disable(); // Save previous IF, run x86 cli.
intr_suspend(); // Stop interrupts from all PICs.
intr_enable_src(AcpiGbl_FADT.SciInterrupt); // Enable SCIs (interrupt 9).

// Sleep...

intr_resume(false); // Resume interrupts on all PICs.
intr_restore(rflags); // Restore IF.

When an SCI is triggered, the OS is supposed to read a special register to figure out what GPE number (general purpose event) caused this interrupt. This blog post explains this in further detail.

Not all of these GPEs should cause an interrupt though. For example, my Framework's embedded controller sends me a GPE once a second to update me on the battery status. Obviously, we don't want this to wake the system up from sleep.

So ACPI has a mechanism for masking out GPEs coming from specific devices, namely through the _DSW (or _PSW for older devices, see ACPI 7.3.1) method.

The issue is that lots of laptops will put important wake devices under the same GPE number as noisy devices such as the battery mentioned previously. Here is some simplified ASL code showing that the lid and battery status change GPEs are under the same GPE number:

Device (EC0) { // The embedded controller.
	Name (_GPE, 0x0B) // GPE number.
	Device (LID0) { /* ... */ }
	Method (_Q01, 0, NotSerialized) { // GPE for lid device.
		P80H = 0x01
		Notify (LID0, 0x80) // Status change.
	}
	Method (_Q3C, 0, NotSerialized) { // VERY noisy GPE for battery (1 GPE/s).
		P80H = 0x3C
		Notify (BAT1, 0x80) // Status change.
	}
}
Device (BAT1) { /* ... */ }

This means that, if we mask out the battery GPE, we also mask out the lid GPE, which is no good. This is mitigated somewhat by entering suspend on my machine, but the battery will still emit a GPE, just a bit slower, at once a minute. Hopefully once we get to S0i3 the firmware will know to shut up with the useless GPEs, and 1 minute is more than enough time to enter S0i3.

We might still not be 100% safe from spurious wakeups, so the solution that Linux uses and the solution that I'll implement in FreeBSD soon is to have an "s2idle loop". When the CPU is woken up from idle, the OS will check what the last wakeup even was, and if it doesn't agree that it should have been woken up, it will immediately idle the CPUs again.

Checking the last wakeup event means that we can expose it as a sysctl for free, which is great for a user wanting to debug for what reason their laptop woke up in the middle of the night.

Idling the CPU (MWAIT)

The final step the OS has to take is to idle the CPUs.

To go to sleep, we need to set them to their maximum C-state (CPU power state) and, if we did the previous setup right, the firmware will hopefully take care of the rest.

The MWAIT instruction can do this for us. It's an x86 instruction that's usually used in conjunction with MONITOR to enter an "implementation-dependent optimized state" and wait until a specific memory range is written to.

If CPUID.05H:ECX[bit 0] is set, it can also be used for power management. Specifically, eax can be set to contain hints to MWAIT and ecx (extension) can be set to contain the C-state that the processor should enter.

For our purposes, we can set the lowest bit of eax to 1 to allow for interrupts to break out of MWAIT (i.e. wake the CPU up). Thanks to this, we can forgo the need to set up a memory range to monitor.

Bits 7 to 4 of ecx are used to specify the target C-state to enter (we can ignore the lowest 4 bits which are for "sub C-states"):

mov eax, 0x30 ; C-state C4 (MWAIT_C4).
mov ecx, 1    ; Break on interrupt, like hlt (MWAIT_INTRBREAK).
mwait

FreeBSD's cpu_idle() function will use MWAIT when it's available.

Vendor-specific complications: AMD

On AMD, there are a few extra thing we need to do for the PMFW running on the SMU to actually enter S0i3. As mentioned earlier, these conditions can be checked with the amd_s2idle.py script on Linux:

If any of these conditions are unmet, PMFW will refuse to transition to S0i3, and you will get negligible power savings.

I have built up a minimal kernel config starting from make tinyconfig with just enough enabled to actually enter S0i3 for debugging. A special thanks to Mario Limonciello (superm1) from AMD for helping me figure this all out.

What about hibernation (S4)?

Hibernation actually has little to do with S0ix. Instead of suspending-to-RAM (i.e. keeping it active while the rest of the system is powered off), hibernation swaps all pages in RAM to disk and then completely powers off the system. When you want to exit out of hibernation, the bootloader reads back the image from disk to memory to restore the system to its previous state.

S4 saves more power than S0ix (actually, in S4 the system consumes no power at all), but the downside is that it of course takes way longer to enter and exit.

FreeBSD actually has support for S4BIOS, which was a transitional way of doing hibernation where the BIOS does most of the heavy-lifting intended to ease the adoption of S4. This doesn't exist on modern laptops, but you can check if yours has it through hw.acpi.s4bios.

There is also hybrid suspend, in which the system enters an S3 or S0ix state but still writes the hibernation image to disk anyway. This way, you get the advantages of fast wake times but you don't risk corrupting your filesystem if the battery reaches a critical level and your system suddenly loses power.

What's next? 🔮

Here's a grab-bag of things that still need to be done or would be nice to have: