mstdn.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
A general-purpose Mastodon server with a 500 character limit. All languages are welcome.

Administered by:

Server stats:

13K
active users

#compression

3 posts3 participants1 post today

Both ZFS and Btrfs are advance file systems designed for Linux. They both offer features like RAID, compression, and snapshots.

But ZFS is generally considered more mature, reliable, and robust. ZFS is also a popular solution for enterprise.

Btrfs is more Linux-centric solution. Btrfs tends to be easier to integration and use with the Linux kernel.

#ZFS#Btrfs#file

From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning

arxiv.org/abs/2505.17117

arXiv logo
arXiv.orgFrom Tokens to Thoughts: How LLMs and Humans Trade Compression for MeaningHumans organize knowledge into compact categories through semantic compression by mapping diverse instances to abstract representations while preserving meaning (e.g., robin and blue jay are both birds; most birds can fly). These concepts reflect a trade-off between expressive fidelity and representational simplicity. Large Language Models (LLMs) demonstrate remarkable linguistic abilities, yet whether their internal representations strike a human-like trade-off between compression and semantic fidelity is unclear. We introduce a novel information-theoretic framework, drawing from Rate-Distortion Theory and the Information Bottleneck principle, to quantitatively compare these strategies. Analyzing token embeddings from a diverse suite of LLMs against seminal human categorization benchmarks, we uncover key divergences. While LLMs form broad conceptual categories that align with human judgment, they struggle to capture the fine-grained semantic distinctions crucial for human understanding. More fundamentally, LLMs demonstrate a strong bias towards aggressive statistical compression, whereas human conceptual systems appear to prioritize adaptive nuance and contextual richness, even if this results in lower compressional efficiency by our measures. These findings illuminate critical differences between current AI and human cognitive architectures, guiding pathways toward LLMs with more human-aligned conceptual representations.

I thoroughly enjoyed Antonio Somaini’s lecture tonight on the Politics of Latent Spaces at the conference Art in the Age of Average. The new AI-thoritarians.

His reflections on compression as a cultural and epistemic process were truly inspiring — and the sources cited were excellent, too ;)

#AI #LatentSpaces #DigitalCulture #Compression #Vectorization #NeuralNetworks #ArtAndAI #MachineVision #epistemiccompression #AIAesthetics @databasecultures

Fascinating.

tmp $ wc -c < somefile.xopp 
735772
tmp $ file somefile.xopp 
somefile.xopp: gzip compressed data, from Unix, original size modulo 2^32 2086031
tmp $ gunzip < somefile.xopp |file -
/dev/stdin: XML 1.0 document, ASCII text, with very long lines (12483)
tmp $ gunzip < somefile.xopp |wc
    937  204466 2086031
tmp $ gunzip < somefile.xopp |bzip2 -9 |wc -c
619543
tmp $ gunzip < somefile.xopp |bzip3 |wc -c
575115
tmp $ gunzip < somefile.xopp |xz -9e |wc -c
519764
tmp $ gunzip < somefile.xopp |grep -m1 "^.stroke" |cut -c 1-160
<stroke tool="pen" color="#3333ccff" width="2.26 0.72691752 0.73026261 0.73809079 0.74588449 0.74364294 0.72915908 0.71467521 0.71133013 0.70908858 0.7057435 0.
tmp $ gunzip < somefile.xopp |grep -oE "\<[0-9]+\.[0-9]+\>" |wc -l
201692
tmp $ echo "735772/201692" |bc -l
3.64799793744917993772
tmp $ echo "519764/201692" |bc -l
2.57701842413184459472
tmp $ echo "2086031/201692" |bc -l
10.34265612914741288697
tmp $ 

#Compression #XML #Xournal #Xournalpp #Xournal++

So @rl_dane introduced #bzip3 to me to use instead of #bzip2. Let's turn some bz2 files into bz3 to see the difference.

First example: 90k opus files

hey snips wake word dataset. It has ~90k opus files and a tar file of 3.1GB. bzip2 produces the same 3.1GB which is as expected. bzip3 created 3.0GB but used tons of computation power. Not worth the 100MB

Second example: Windows 7 virtual box VM image

Windows7.vdi it's Windows 7 VM image for the "special" days. I think I have to get rid of it. But while it is still there, let's see how each will perform. It is 16GB uncompressed. bzip2 -9 is 7.0GB. bzip3 is 6.3GB but at the expense of like 3x CPU time. Deleting all of them anyway. Down with Windows.

Third example: Pure XML text file

Pure XML file. It's Persian and English characters. Uncompressed is 1.7GB. bzip2 -9 is 276M while bzip3 is 260MB

Final example: Creating a simple bomb

So I did this:

dd if=/dev/zero of=./justzero bs=2G count=6

So now I have a 16GB with only zero bytes. bzip2 -9 is 672KB. bzip3 is 46KB.

Conclusion

Thank you @rl_dane

Real nice thing!

So I was short on storage on my archive drive. I saw librewolf source code. It was tar.gz and ~800MB. I uncompressed it then recompressed it with bzip2 -9 and now it's ~600MB. Generally #bzip2 has better compression for such these data than #gzip.

Edit: But don't do bzip2 -9 all and everywhere. Sometimes -4 is the same as -9 however the latter being tons slower. Also there is pbzip2 for using all your CPU cores.