stux @stux

0 posts0 participants0 posts today

**Hacker News** @h4ckernews@mastodon.social · Apr 7

Hacker News @h4ckernews@mastodon.social

Benchmarking LLM social skills with an elimination game

https://github.com/lechmazur/elimination_game

A multi-player tournament benchmark that tests LLMs in social reasoning, strategy, and deception. Players engage in public and private conversations, form alliances, and vote to eliminate each othe...

GitHubGitHub - lechmazur/elimination_game: A multi-player tournament benchmark that tests LLMs in social reasoning, strategy, and deception. Players engage in public and private conversations, form alliances, and vote to eliminate each otherA multi-player tournament benchmark that tests LLMs in social reasoning, strategy, and deception. Players engage in public and private conversations, form alliances, and vote to eliminate each othe...

#HackerNews #Benchmarking #LLM

**PUPUWEB Blog** @pupuweb@mastodon.social · Apr 5

Apr 5

PUPUWEB Blog @pupuweb@mastodon.social

Meta plans to launch #Llama4 later this month after multiple delays, citing underperformance in reasoning & math benchmarks. #AI #MachineLearning #TechNews #LlamaAI #ArtificialIntelligence #Benchmarking #AIResearch

**Hacker News** @h4ckernews@mastodon.social · Apr 3

Apr 3

Hacker News @h4ckernews@mastodon.social

Benchi – A benchmarking tool written in Go

https://github.com/ConduitIO/benchi

Benchmark any tool from the CLI. Contribute to ConduitIO/benchi development by creating an account on GitHub.

GitHubGitHub - ConduitIO/benchi: Benchmark any tool from the CLIBenchmark any tool from the CLI. Contribute to ConduitIO/benchi development by creating an account on GitHub.

#HackerNews #Benchi #Go

Replied in thread

**Xavier B.** @xibe@boitam.eu · Apr 1

Apr 1

Xavier B. @xibe@boitam.eu

@mariejulien À mon avis tu n'as pas encore trouvé ton PMF (Pouët/Market Fit).

Capture d'une notification de repouëts, indiquant "badiane et 136 autres ont boosté votre message". La première ligne du pouët est visible : "Comme à chaque 1er avril, Le Gorafi (...)".

#knowYourAudience #benchmarking #notInKansasAnymore

**Joseph Simons** @josephsimons@mstdn.ca · Mar 27

Mar 27

Joseph Simons @josephsimons@mstdn.ca

"In its #Municipal #Benchmarking 2024 Study, the #CanadianHomeBuildersAssociation has ranked #Edmonton as the most builder-friendly city in #Canada for the second straight year. Edmonton ranked sixth for planning features, fourth for approval time, second for high-rise fees, and sixth for low-rise government fees."

https://www.chba.ca/assets/pdf/CHBA+Municipal+Benchmarking+Study-3rd+Edition-2024/?utm_source=Taproot+Edmonton&utm_campaign=11a28a527e-TAPROOTYEG_PULSE_2025_03_27&utm_medium=email&utm_term=0_ef1adf0932-11a28a527e-438152299&mc_cid=11a28a527e&mc_eid=2af62197a9

**LavX News** @lavxnews@mastodon.cloud · Mar 21

Mar 21

LavX News @lavxnews@mastodon.cloud

Unveiling the Truth: Document AI Benchmarking and Performance Insights

In a landscape saturated with claims of accuracy, a recent benchmark study sheds light on the realities of document AI performance. By evaluating different AI pipelines using the CUAD dataset, the fin...

https://news.lavx.hu/article/unveiling-the-truth-document-ai-benchmarking-and-performance-insights

#news #tech #Benchmarking

**C++Now** @cppnow@mastodon.social · Mar 20 *

Mar 20 *

C++Now @cppnow@mastodon.social

C++Now 2025 SESSION ANNOUNCEMENT: Explore microbenchmark With beman.inplace_vector by River Wu

https://schedule.cppnow.org/session/2025/explore-microbenchmark-with-beman-inplace_vector/

schedule.cppnow.orgExplore microbenchmark With beman.inplace_vector – C++Now Schedule

#benchmarking #cplusplus #cpp

**B166IR** @b166ir@k2pk.com · Mar 12 *

Mar 12 *

B166IR @b166ir@k2pk.com

https://youtu.be/J4qwuCXyAcU

In this video, Ollama vs. LM Studio (GGUF), showing that their performance is quite similar, with LM Studio’s tok/sec output used for consistent benchmarking.

What’s even more impressive? The Mac Studio M3 Ultra pulls under 200W during inference with the Q4 671B R1 model. That’s quite amazing for such performance!

youtu.be- YouTubeEnjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.

#LLMs #AI #MachineLearning

**Habr** @habr@zhub.link · Mar 3

Mar 3

Habr @habr@zhub.link

[Перевод] Оценка больших языковых моделей в 2025 году: пять методов

Большие языковые модели (LLM) в последнее время стремительно развиваются и несут в себе потенциал для кардинального преобразования ИИ. Точная оценка моделей LLM крайне важна, поскольку: • Компании должны выбирать генеративные AI-модели для внедрения в работу. Базовых моделей LLM сейчас множество, и для каждой есть различные их модификации. • После выбора модели будет проходить fine-tuning. И если производительность модели не измерена с достаточной точностью, пользователи не смогут оценить эффективность своих усилий. Таким образом, необходимо определить: • Оптимальные методы оценки моделей • Подходящий тип данных для обучения и тестирования моделей Поскольку оценка систем LLM является многомерной задачей, важно разработать комплексную методологию измерения их производительности. В этой статье рассматриваются основные проблемы существующих методов оценки и предлагаются решения для их устранения.

https://habr.com/ru/articles/887290/

ХабрОценка больших языковых моделей в 2025 году: пять методовБольшие языковые модели (LLM) в последнее время стремительно развиваются и несут в себе потенциал для кардинального преобразования ИИ. Точная оценка моделей LLM крайне важна, поскольку: Компании...

#llm #ai #benchmarking

**Andrew Jones (hpcnotes)** @hpcnotes@mast.hpc.social · Mar 1

Mar 1

Andrew Jones (hpcnotes) @hpcnotes@mast.hpc.social

UK based #HPC benchmarking role at Microsoft

Requires real experience with hands on HPC #benchmarking - porting, compiling, tuning, performance analysis etc. of scientific codes on HPC systems

https://buff.ly/fKfQz6j

buff.lySign Up | LinkedIn500 million+ members | Manage your professional identity. Build and engage with your professional network. Access knowledge, insights and opportunities.

**Habr** @habr@zhub.link · Feb 27

Feb 27

Habr @habr@zhub.link

[Перевод] Бенчмаркинг AI-агентов: оценка производительности в реальных задачах

AI-агенты уже решают реальные задачи — от обслуживания клиентов до сложной аналитики данных. Но как убедиться, что они действительно эффективны? Ответ заключается в комплексной оценке AI-агентов. Чтобы AI-система была надежной и последовательной, важно понимать типы AI-агентов и уметь их правильно оценивать. Для этого используются продвинутые методики и проверенные фреймворки оценки AI-агентов. В этой статье мы рассмотрим ключевые метрики, лучшие практики и основные вызовы, с которыми сталкиваются компании при оценке AI-агентов в корпоративных средах.

https://habr.com/ru/articles/886198/

ХабрБенчмаркинг AI-агентов: оценка производительности в реальных задачахAI-агенты уже решают реальные задачи — от обслуживания клиентов до сложной аналитики данных. Но как убедиться, что они действительно эффективны? Ответ заключается в комплексной оценке AI-агентов....

#ai_agent #benchmarking #ии_агенты

**Wizards Anonymous** @crft@mastodon.social · Feb 25

Feb 25

Wizards Anonymous @crft@mastodon.social

Curious which #OpenSource options #Wizards prefer to utilize for #Benchmarking #Disk / #SSD. :)

**HGPU group** @hgpu@mast.hpc.social · Feb 24

Feb 24

HGPU group @hgpu@mast.hpc.social

Evaluating the Performance of the DeepSeek Model in Confidential Computing Environment

#Security #DeepSeek #LLM #Cloud #Performance #Benchmarking

https://hgpu.org/?p=29782

hgpu.org · Feb 24Evaluating the Performance of the DeepSeek Model in Confidential Computing EnvironmentThe increasing adoption of Large Language Models (LLMs) in cloud environments raises critical security concerns, particularly regarding model confidentiality and data privacy. Confidential computin…

**Hacker News** @h4ckernews@mastodon.social · Feb 23

Feb 23

Hacker News @h4ckernews@mastodon.social

Benchmarking VLMs vs. Traditional OCR — https://getomni.ai/ocr-benchmark
#HackerNews #Benchmarking #VLMs #TraditionalOCR #AItechnology #MachineLearning #OCRbenchmark

getomni.aiOCR Benchmark - Omni AIComprehensive benchmark of OCR accuracy across traditional OCR providers and multimodal Language Models

**LavX News** @lavxnews@mastodon.cloud · Feb 19

Feb 19

LavX News @lavxnews@mastodon.cloud

Benchmarking Made Easy: A Deep Dive into Go and Python Performance Testing

Benchmarking is crucial for software performance, and both Go and Python offer powerful tools for developers. This article explores how to effectively implement benchmarking in both languages, highlig...

https://news.lavx.hu/article/benchmarking-made-easy-a-deep-dive-into-go-and-python-performance-testing

#news #tech #Benchmarking

**Andrew Jones (hpcnotes)** @hpcnotes@mast.hpc.social · Feb 17

Feb 17

Andrew Jones (hpcnotes) @hpcnotes@mast.hpc.social

Olga Pearce from LLNL giving a talk on #benchmarking for #HPC at #MW25NZ

Proposing a specification for running HPC benchmarks - benchpark - to help automation, reuse, reproducibility, tracking, etc.

**Jeff Fortin T.** @nekohayo@mastodon.social · Feb 10

Feb 10

Jeff Fortin T. @nekohayo@mastodon.social

The rabbithole investigation of Nautilus' very slow cold-disk-cache folders loading performance continued this week end.
Latest findings here: https://gitlab.gnome.org/GNOME/nautilus/-/issues/3374#note_2345406

A Sysprof flamegraph showing what happens when you repeatedly reload a 2000-items folder on a warm disk cache

A heatmap table showing that thumbnail checking attributes for files have a huge cost on folder load performance, the difference between 15+ seconds and 1-2 seconds.

#GNOMEFiles #Nautilus #GNOME

**James Yung** @pronoiac@mefi.social · Feb 8

Feb 8

James Yung @pronoiac@mefi.social

Surely someone's looked into this: if I wanted to store millions or billions of files on a filesystem, I wouldn't store them in one single subdirectory / folder. I'd split them up into nested folders, so each folder held, say, 100 or 1000 or n files or folders. What's the optimum n for filesystems, for performance or space?
I've idly pondered how to experimentally gather some crude statistics, but it feels like I'm just forgetting to search some obvious keywords.
#BillionFileFS #linux #filesystems #optimization #benchmarking

Continued thread

**Microsoft DevBlogs** @msftdevblogs@dotnet.social · Jan 31

Jan 31

Microsoft DevBlogs @msftdevblogs@dotnet.social

Join the conversation and optimize your projects!

#VisualStudio #Benchmarking #PerformanceOptimization

This thread was auto-generated from the original post, which can be found here: https://devblogs.microsoft.com/visualstudio/benchmarking-with-visual-studio-profiler/.

Visual Studio Blog · Jan 7Benchmarking with Visual Studio Profiler - Visual Studio BlogWe have updated BenchmarkDotNet diagnosers, allowing you to use more of the tools in the performance profiler to analyze benchmarks. With this change it is super quick to dig into CPU usage and allocations of benchmarks making the measure, change, measure cycle quick and efficient.

**Leiden Madtrics** @leidenmadtrics@social.cwts.nl · Jan 30

Jan 30

Leiden Madtrics @leidenmadtrics@social.cwts.nl

New blogpost!

Benchmarking - an appropriate method for evaluating research units? Thed van Leeuwen and Frank van Vree explore possibilities and caveats, particularly in the context of the Dutch Strategy Evaluation Protocol (SEP).

You can read the bi-lingual post here:
𝘌𝘕𝘎 https://www.leidenmadtrics.nl/articles/benchmarking-in-research-evaluations-we-can-do-without-it
𝘕𝘓 https://www.leidenmadtrics.nl/articles/benchmarking-bij-onderzoeksevaluaties-we-kunnen-zonder

**#benchmarking** **#ResearchEvaluation**

www.leidenmadtrics.nlBenchmarking in research evaluations: we can do without itThe Strategy Evaluation Protocol (SEP) 2021-2027 proposes benchmarking as a method for evaluating research units. But what exactly does this entail and what are the risks? Our authors dive deeper into this topic and show what is possible and what to be careful about.

Recent searches

Search options

Administered by:

Server stats:

#benchmarking