Sleep study autoscoring is starting to go beyond simply calculating the apnea-hypopnea index.

By Lisa Spear

In many cases, sleep study autoscoring simply speeds up an otherwise manual process without adding new information. But emerging research has evaluated new applications that could identify more detailed sleep architecture, define sleep apnea phenotypes, or even change the way sleep disorders are diagnosed. 

At Stanford University School of Medicine, Oliver Sum-Ping, MD, leads an autoscoring project for the sleep division. He cites work by his colleague Emmanuel Mignot, MD, PhD, on how an algorithm could detect narcolepsy with cataplexy in a single overnight sleep study as an example of novel insights that may be gleaned by autoscoring.1

“Traditionally, we use an overnight sleep study and an MSLT [multiple sleep latency test] as a separate study the day after, which is pretty time-intensive and resource-intensive—patients spend nearly 24 hours in the lab. But with this system, with at least type 1 narcolepsy, it works fairly well for detecting narcolepsy in a single night,” Sum-Ping says. “Right now, it is just a research study, but I am hopeful that it can be applied more in clinical practice. I think there are different layers to why it is not used yet, but a lot of it is just diagnostic standards that require the MSLT. But it is a good example of the kind of promise that autoscoring has.”

By necessity, sleep scoring has been simplified, completely bypassing valuable data that could tell us more about human health. When scoring a sleep study, techs are looking at a superficial EEG, says Rafael Pelayo, MD, a sleep specialist at Stanford’s Sleep Medicine Center.

“Sometimes we lose sight of how the classic scoring that we do really arises only from a very superficial look at what is happening in the brain, since we are looking at superficial EEG, and that is only a single layer of the neocortical ribbon that is being represented in that EEG and there are six layers to it. So those other five layers we completely ignore in our description of sleep,” says Pelayo, a clinical professor of psychiatry and behavioral sciences in sleep medicine.

Machine learning within autoscoring platforms could tap into some of the vast data that is currently ignored or underutilized. 

Sum-Ping says, in the future, there could be machine learning applications for identifying rapid eye movement (REM) sleep behavior disorder. Autoscoring tools could also lead to the development of phenotyping for the diagnosis of obstructive sleep apnea and speed up the diagnostic process for those who experience narcolepsy. 

A paper published in Sleep explains how to extract the physiological subtypes, endotypes, of sleep apnea from conventional polysomnography. Scientists are now using this method in research to predict which patients are likely to tolerate CPAP or other sleep apnea treatments. “We are working on using this information to determine which patients may not benefit from CPAP treatment and who will benefit,” says paper coauthor Jón Skírnir Ágústsson, PhD, vice president of artificial intelligence and data research at Nox Research.2

Sleep study autoscoring could also bypass human bias and lead to other novel insights about sleep that would typically be too difficult to detect with just the human eye. Some of the conventions of how sleep techs score studies are due to the practical limitations of manual scoring. 

“The way sleep stages are defined is very crude. It is obvious that we do not sleep in five discrete stages: wake, non-REM 1, non-REM 2, non-REM 3, and REM. We also do not sleep in 30-second intervals. So I think there is a lot of opportunity with autoscoring in better determining sleep depth as a continuous process and in determining the dynamics of sleep in a better way,” says Ágústsson.

Currently, manual scorers look at a night of PSG in 30-second epochs. But with other, perhaps more efficient scoring technologies that incorporate machine learning, we could look at smaller increments and get better resolution of patient’s sleep, says Sum-Ping.

There are also aspects of manual scoring that are tedious and can be inaccurate, Pelayo says, including microarousals and sleep fragmentations. These events are more quantifiable and accurate with automatic scoring, he says.

Additionally, autoscoring leads to a higher level of sleep study scorer agreement, says Andrea Ramberg, CCSH, RPSGT, clinical director at autoscoring software marketer EnsoData and president of the Board of Registered Polysomnographic Technologists.

For instance, according to information provided by the company, EnsoData’s EnsoSleep software beats the national average agreement rates. “In a field where 85% agreement represents the gold standard,3 EnsoSleep consistently outperforms the national average, boasting an 86.6% overall agreement, exceeding the published inter-scorer reliability results,” says Ramberg.

“A key characteristic of EnsoSleep is consistency. Much like a technologist reviewing sleep study raw data, our autoscoring model works directly from the waveform signals. However, our models have been trained on data from hundreds of thousands of past scoring examples,” says Sam Rusk, EnsoData co-founder and president. “While the same RPSGT may produce different results for the same sleep study, the same is not true for EnsoSleep. EnsoSleep will always produce the same results when presented with the same test. In the subjective field of sleep scoring, that consistency is unique.”

Still, even though autoscoring shows much promise, certain artifacts can be misjudged by autoscoring algorithms—such as sweat artifact, respiratory artifact, and eye blinks in the EEG (often mistaken for NREM3/deep sleep), says Heather Tomson, RPSGT, registered polysomnographic technologist in customer success at sleep diagnostics company Cerebra. 

Also, if there are issues with the respiratory belt channels, autoscoring can incorrectly label the type of apnea, whereas a technologist will be able to look at other parameters to help them accurately assign event types, Tomson says. If the patient is mouth breathing and a nasal pressure transducer is used, this can be mistaken by the algorithm for apnea. If the quality of the recording was poor, it may be difficult for autoscoring to properly evaluate the test. 

Stanford’s Pelayo says, “Just like humans make errors, automatic scoring will make errors.” He says there will always be a role that sleep technologists will play in deciphering sleep studies. Autoscoring will just be another tool that is integrated into the lab. 

“We do not have to worry about sleep technologists losing their work—Their work will transform, and this will be just another tool that they will get to know.” 

And while many labs have not yet embraced sleep study autoscoring, Pelayo says, “It will inevitably be part of what we do.”

Lisa Spear is associate editor of Sleep Review. 


1.Stephansen JB, Olesen AN, Olsen M, et al. Neural network analysis of sleep stages enables efficient diagnosis of narcolepsy. Nat Commun. 2018 Dec 6;9(1):5229.

2. Finnsson E, Ólafsdóttir GH, Loftsdóttir DL, et al. A scalable method of determining physiological endotypes of sleep apnea from a polysomnographic sleep study. Sleep. 2021 Jan 21;44(1):zsaa168.

3. Rosenberg RS, Van Hout S. The American Academy of Sleep Medicine inter-scorer reliability program: sleep stage scoring. J Clin Sleep Med. 2013 Jan 15;9(1):81-7.