Google Gemini Outshines Human Health Coaches

Google Gemini Outshines Human Health Coaches

Get our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI news.

Google Gemini might be just six months old, but it’s already showing impressive skills in security, coding, debugging, and more (though it still has its limitations). Now, it’s even beating humans when it comes to sleep and fitness advice.

Researchers at Google have introduced the Personal Health Large Language Model (PH-LLM), a version of Gemini that’s been fine-tuned to handle and interpret time-series personal health data from wearables like smartwatches and heart rate monitors. In their experiments, this model answered questions and made predictions better than seasoned health and fitness experts.

Google recently published details about the Personal Health Large Language Model. It’s fine-tuned on Gemini and can read your wearable data to provide personalized insights and recommendations, outperforming professional sleep and fitness experts on certification exams.

“Our work uses generative AI to expand model utility from predicting health states to providing contextual and potentially prescriptive outputs based on complex health behaviors,” the researchers explained.

Wearable technology can help people monitor and improve their health. These devices offer a wealth of data through passive and continuous tracking from various inputs like exercise and diet logs, mood journals, and sometimes even social media activity. However, this data, covering sleep, physical activity, cardiometabolic health, and stress, is rarely used in clinical settings due to its complexity and the effort required to interpret it.

While LLMs have performed well in medical question-answering, analyzing electronic health records, diagnosing from medical images, and conducting psychiatric evaluations, they’ve often struggled to interpret data from wearables. Google’s researchers made a significant breakthrough with PH-LLM, training it to make recommendations, answer professional-level questions, and predict self-reported sleep disruptions and impairments.

In their tests, PH-LLM achieved 79% in sleep exams and 88% in fitness exams, surpassing the average scores of human experts. Five professional athletic trainers (with an average of 13.8 years of experience) and five sleep medicine experts (with an average of 25 years of experience) achieved average scores of 71% in fitness and 76% in sleep.

For example, when asked to analyze a 50-year-old man’s sleep data, PH-LLM indicated he was having trouble falling asleep and emphasized the importance of deep sleep for physical recovery. It recommended keeping the bedroom cool and dark, avoiding naps, and maintaining a consistent sleep schedule.

For a fitness query on the type of muscular contraction during a bench press, PH-LLM correctly identified “eccentric” among four multiple-choice options.

When asked about difficulties falling asleep based on wearable data, the model correctly indicated that the user likely experienced difficulty falling asleep multiple times over the past month.

The researchers pointed out that while further development and safety evaluations are necessary, these results demonstrate both the broad knowledge and capability of Gemini models.

To achieve these results, the researchers created and curated three datasets to test personalized insights and recommendations from physical activity, sleep patterns, and physiological responses, using expert domain knowledge. They developed 857 case studies, including 507 for sleep and 350 for fitness, collaborated with domain experts. The case studies used wearable sensor data and demographic information, analyzing metrics like sleep scores, heart rates, REM sleep percentages, steps, and fat burning minutes.

“Our study shows that PH-LLM can integrate passively-acquired data from wearable devices into personalized insights and recommendations to improve sleep hygiene and fitness outcomes,” the researchers noted.

However, they recognized that PH-LLM is just the beginning and has its flaws. The model’s responses were sometimes inconsistent, and it showed noticeable differences across case studies. It was sometimes overly conservative, sensitive to over-training, and did not always identify under-sleeping as a potential issue. Moreover, the case studies were not fully representative of the population, often focusing on relatively active individuals.

The researchers stressed the need for further work to ensure that LLMs are reliable, safe, and fair in personal health applications, including reducing inaccuracies, considering unique health circumstances, and ensuring the training data reflects diverse populations.

Still, they concluded, “The results from this study are an important step towards LLMs that deliver personalized information and recommendations, helping individuals achieve their health goals.”