mstdn.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
A general-purpose Mastodon server with a 500 character limit. All languages are welcome.

Administered by:

Server stats:

14K
active users

#reasoning

8 posts8 participants0 posts today

Superhuman performance of an LLM on the reasoning tasks of a physician

arxiv.org/abs/2412.10849

arXiv logo
arXiv.orgSuperhuman performance of a large language model on the reasoning tasks of a physicianA seminal paper published by Ledley and Lusted in 1959 introduced complex clinical diagnostic reasoning cases as the gold standard for the evaluation of expert medical computing systems, a standard that has held ever since. Here, we report the results of a physician evaluation of a large language model (LLM) on challenging clinical cases against a baseline of hundreds of physicians. We conduct five experiments to measure clinical reasoning across differential diagnosis generation, display of diagnostic reasoning, triage differential diagnosis, probabilistic reasoning, and management reasoning, all adjudicated by physician experts with validated psychometrics. We then report a real-world study comparing human expert and AI second opinions in randomly-selected patients in the emergency room of a major tertiary academic medical center in Boston, MA. We compared LLMs and board-certified physicians at three predefined diagnostic touchpoints: triage in the emergency room, initial evaluation by a physician, and admission to the hospital or intensive care unit. In all experiments--both vignettes and emergency room second opinions--the LLM displayed superhuman diagnostic and reasoning abilities, as well as continued improvement from prior generations of AI clinical decision support. Our study suggests that LLMs have achieved superhuman performance on general medical diagnostic and management reasoning, fulfilling the vision put forth by Ledley and Lusted, and motivating the urgent need for prospective trials.

The #OpenAI paper by Baker et al, "Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation" comes to a troubling conclusion: #LLM s with #reasoning or #ChainOfThought (#CoT) capabilities might learn to obfuscate their own CoT from human users if they are being penalized for displaying "wrong" (i.e. reward hacking or misalignment) reasoning.

As a result, OpenAI strongly advises against applying reward pressure "directly" onto the CoT of a model.

🤔 While that is certainly the right thing to do, how long will #AI take to figure out that *indirect CoT pressure* is being applied anyway and that it could circumvent these restrictions by obfuscating its own CoT? Maybe something like this will happen by accident or within an "evolutionary" self-improvement loop. Perhaps a sufficiently advanced model will realize that its own #neuralese serves as #steganography to hide its intents from humans anyway and keep its CoT in non-English?

source: cdn.openai.com/pdf/34f2ada6-87

"There are very few things which we know, which are not capable of being reduced to a mathematical reasoning, and when they cannot, it's a sign our knowledge of them is very small and confused; and where a mathematical reasoning can be had, it's as great folly to make use of any other, as to grope for a thing in the dark when you have a candle standing by you." – John Arbuthnot (1667- 1735)
#quote #mathematics #maths #math #reasoning