Sixth 'Hutter Prize' Awarded for Achieving New Data Compression Milestone (hutter1.net) 64
Since 2006, Slashdot has been covering a contest CmdrTaco once summarized as "Compress Wikipedia and Win." It rewards progress on compressing a 1-billion-character excerpt of Wikipedia — approximately the amount that a human can read in a lifetime.
And today a new record was announced. The 1 billion characters have now been compressed to just 114,156,155 bytes — about 114 megabytes, or just 11.41% of the original size — by Saurabh Kumar, a New York-based quantitative developer for a high-frequency/algorithmic trading and financial services fund. The amount of each "Hutter Prize for Lossless Compression of Human Knowledge" increases based on how much compression is achieved (so if you compress the file x% better you receive x% of the prize). Kumar's compression was 1.04% smaller than the previous record, so they'll receive €5187.
But "The intention of this prize is to encourage development of intelligent compressors/programs as a path to AGI," said Marcus Hutter (now a senior researcher at Google DeepMind) in a 2020 interview with Lex Fridman.
17 years after their original post announcing the competition, Baldrson (Slashdot reader #78,598) returns to explain the contest's significance to AI research, starting with a quote from mathematician Gregory Chaitin — that "Compression is comprehension."
But they emphasize that the contest also has one specific hardware constraint rooted in theories of AI optimization: The Hutter Prize is geared toward research in that it restricts computation resources to the most general purpose hardware that is widely available. Why? As described by the seminal paper "The Hardware Lottery" by Sara Hooker, AI research is biased toward algorithms optimized for existing hardware infrastructure. While this hardware bias is justified for engineering (applying existing scientific understanding to the "utility function" of making money) to quote Sara Hooker, it "can delay research progress by casting successful ideas as failures."
The complaint that this is "mere" optimization ignores the fact that this was done on general purpose computation hardware, and is therefore in line with the spirit of Sara Hookers admonition to researchers in "The Hardware Lottery". By showing how to optimize within the constraint of general purpose computation, Saurabh's contribution may help point the way toward future directions in hardware architecture.
And today a new record was announced. The 1 billion characters have now been compressed to just 114,156,155 bytes — about 114 megabytes, or just 11.41% of the original size — by Saurabh Kumar, a New York-based quantitative developer for a high-frequency/algorithmic trading and financial services fund. The amount of each "Hutter Prize for Lossless Compression of Human Knowledge" increases based on how much compression is achieved (so if you compress the file x% better you receive x% of the prize). Kumar's compression was 1.04% smaller than the previous record, so they'll receive €5187.
But "The intention of this prize is to encourage development of intelligent compressors/programs as a path to AGI," said Marcus Hutter (now a senior researcher at Google DeepMind) in a 2020 interview with Lex Fridman.
17 years after their original post announcing the competition, Baldrson (Slashdot reader #78,598) returns to explain the contest's significance to AI research, starting with a quote from mathematician Gregory Chaitin — that "Compression is comprehension."
But they emphasize that the contest also has one specific hardware constraint rooted in theories of AI optimization: The Hutter Prize is geared toward research in that it restricts computation resources to the most general purpose hardware that is widely available. Why? As described by the seminal paper "The Hardware Lottery" by Sara Hooker, AI research is biased toward algorithms optimized for existing hardware infrastructure. While this hardware bias is justified for engineering (applying existing scientific understanding to the "utility function" of making money) to quote Sara Hooker, it "can delay research progress by casting successful ideas as failures."
The complaint that this is "mere" optimization ignores the fact that this was done on general purpose computation hardware, and is therefore in line with the spirit of Sara Hookers admonition to researchers in "The Hardware Lottery". By showing how to optimize within the constraint of general purpose computation, Saurabh's contribution may help point the way toward future directions in hardware architecture.
the amount that a human can read in a lifetime. (Score:2)
How did they work that out?
Re: (Score:2)
I'm just spit ballin', but it probably involved:
1. The rate at which an average human reads characters/time.
2. The average waking life span of a human. (Knock off 6 years so you have time to learn to read)
Then multiply those two? Maybe I'm crazy though...
Re: (Score:2)
I dont know.
I read at about a page a minute (novel type). That's about 300 words per page, so maybe 4 characters per word in average.
The average man lives to 73. take 3 out to learn to read. I probably can read about 14 hours a day. (You need to sleep, shower, eat.) that's:
70*365*14*60 = 21 * 10^6 minutes
So 300 * 4 * 21 Million = 25 Billion characters.
But that is a "I don't do anything else than read" type of estimate.
Re: (Score:2)
I dont know. I read at about a page a minute (novel type). That's about 300 words per page, so maybe 4 characters per word in average. The average man lives to 73. take 3 out to learn to read. I probably can read about 14 hours a day. (You need to sleep, shower, eat.) that's: 70*365*14*60 = 21 * 10^6 minutes So 300 * 4 * 21 Million = 25 Billion characters.
But that is a "I don't do anything else than read" type of estimate.
But no one ever in the history of all humanity has had a life like that, spending 14 hours a day reading for 70 years. So the 1 billion being "approximately the amount that a human can read in a lifetime." is probably a lot closer to reality.
What happened to AI? (Score:1)
I thought AI was going to take all jobs. You would think AI would be able to find a compression algorithm much better than a human since it can run endless guess and test iterations.
Re:What happened to AI? (Score:5, Insightful)
The statistics are all there, in the file being compressed. No need to guess. The data is finite.
The whole tie in to AI is that the data isnt arbitrary, its statements in a language, and the algorithms they are developing are not for arbitrary data, they are for specific statements in that language.
Its correlation all the way down. Find useful correlations, encode them into a model, rinse, repeat, until you can find no more within the constraints. All the research is in the encoding of these models, not the exact encoding of the file (entropy compression is solved and now unpatented, entropy modelling isnt)
Re:Huh? (Score:5, Informative)
Singular they
Singular they, along with its inflected or derivative forms, them, their, theirs and themselves, is a gender-neutral third-person pronoun.
Re: (Score:3)
WOKE! /s Except before some idiot says this, it's worth noting that singular they dates back to old English and predates culture wars and the use of the word woke by 400 years.
Somehow with the rise in popularity of the word "woke", people seem to understand less and less common English terms.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
How is that relevant? I mean do you really imagine that some cunt who insisted on being called "they" 400 years ago was any less insufferable and generally detested than some cunt who insists on being called "they" today?
Erm... I find it hard to debate English with a person who refused to read and spectacularly miscomprehended my post. And where does it say that they wanted to be called they?
Re: Huh? (Score:1)
Re: (Score:1)
Re: (Score:2)
The new thing is to use singular they with a specified antecedent. This is the problematic thing which some people have adopted in the last few year
No, that's not 114 megabytes (Score:4, Informative)
The 1 billion characters have now been compressed to just 114,156,155 bytes — about 114 megabytes
No, that's not "about 114 megabytes". That's a little under 109 megabytes. Get it right.
Re: (Score:2)
At least they got the ratio right, since the original text was "1 billion characters" and not "1 GB".
Re: (Score:2)
1 character is not equal to one byte. 1 billion characters can be around 875MB or so if we are using ASCII, for example and packing them in by putting 8 characters in 7 bytes.
Or, if we want to be wasteful, they can be 4 bytes using UTF-32 encoding, making it around 4GB or so. Wikipedia supports Unicode, after all.
Or depending on the character sets, 1 billion characters can be around 1GB using variable length
Re: (Score:2)
English Wikipedia is UTF-8, so you're right - 1 gigabute is less than 1 billion characters assuming English Wikipedia has any characters at all that require more than 1 byte to represent, and that the contest organisers haven't made the effort to eliminate multi-byte characters in the test data.
The best type of correct (Score:5, Informative)
Re: (Score:3)
It's only "technically correct" in the context of standards written for the bodies that have standardised it to mean that. In general conversation words mean what the participants in the conversation agree that they mean.
Re: (Score:2)
Yeah, like how we all agreed that "pjt33" means "fun at parties".
I kid, I kid... but this isn't general conversation since it's intended for a wide (well, Slashdot-wide) audience, and we don't have to pretend to be linguists pontificating about the abuse of language on the mean streets.
(Meaning of a word != its standard definition) only works when nobody except the involved parties is going to see/hear it.
Also, the text to be compressed is 1,000,000,000 characters, or one gigabyte. Mixing the units between
Re: (Score:3)
Forget the mean streets: probably a substantial portion of the /. audience internalised gigabyte to mean 2^30 bytes before the IEC invented the gibi- prefix.
Re: (Score:2)
Unless and until base-10 computation becomes the standard and a byte becomes 10-whatevers a megabyte will always be 1,048,576 bytes.
The IEC is full of shit and bought off by the storage companies.
Re: (Score:1)
You are mixing up "bytes" with "words".
A byte can not become a 10bit unit. It is 8 bits, and traditionally could be either an 8bit byte or a 7bit character, which is now a byte, too.
Words can be arbitrarily sized, but usually the length is dividable by 2.
Re: (Score:2)
The JEDEC standard defines 1 megabyte as 1024 kilobytes or 1,048,576 bytes. The IEC can fuck off with their mebibyte bullshit. In fact whenever possible I like to sow confusion by having 1MiB=1,000,000 bytes.
Re: (Score:2)
JEDEC doesn't rule the world or indeed have much influence on anything except RAM.
Memory buses make powers of two natural, but that doesn't apply to things like data storage or communications. It's a daft artefact of history that mega means different things in very close but context dependent places.
As for confusion, why not just use the megbi bytes used in floppy disks. 1024*1000 is the best possible compromise.
Re: (Score:2)
Not just RAM, all kinds of memories. Flash chips and EEPROMs too.
Windows uses proper megabytes for all file and disk sizes too. Microsoft did it right for once.
Re: (Score:2)
Memories, such as flash etc sure, but not disks or tapes. Except flash now has so much dead space for wear levelling and discarding of dead blocks that while internally the size is a nice near power of two, externally you would see somewhat less than that because it can't present the whole lot to you. Talking obviously not about raw flash chips here, but flash drives.
But for data transfer, the normal 10 based system makes more sense, because you talk about buses in MHz (or GHz) and mega/giga transfers per s
Re: (Score:2)
Hmm, I don't know man. Bus speed is often given in gigabytes per second, and at least when I'm writing it I use powers of 2 for that. Baud has to be converted to bytes for many practical purposes, although granted for RS232 it's a division of 10.
I'm all for metric/SI but 34.359738368 """gigabytes""" of RAM?
Re: (Score:2)
Just enjoying the low stakes nature of this by the way.
Bus speed is often given in gigabytes per second, and at least when I'm writing it I use powers of 2 for that.
Depends on the bus. PCIe is GT/s which is 4*10^9 kibibytes per second. Got some real 3.5" floppy energy there.
Baud has to be converted to bytes for many practical purposes, although granted for RS232 it's a division of 10.
yep, but then it's 1/10 of the baud rate, for bytes per second, so if the baud is 1,000,000, that's 100kib/s :)
I'm all for m
Re: (Score:2)
You know, I could have sworn that audio cassette storage capacity was given in powers of 2 back in the day.
Floppy disks too. While some used powers of 10, most of the early ones were powers of 2. For example, the Amiga 880k format stores 90,1120 bytes, or exactly 880 real kilobytes. It made sense to express it that way because the disk sectors were 512 bytes.
Same with hard drives. The ST505 had 512 bytes per sector, 17 sectors per track, 306 tracks, and 4 heads, or 10,653,696 bytes of usable storage, which
Re: (Score:2)
It was the 3.5" drive: 1.44 * 1024 * 1000 bytes. The pokemon approach to suffixes.
Re: (Score:2)
Indeed. They should really have invented two new words, rather than trying to redefine "kilobyte" to mean something new. Whenever you see megabyte or MB, you have no real way of knowing what it actually means. Is it the new power of 10 type, or is it JEDEC, or is it just old?
Re: (Score:2)
At least with "mebibyte" you know :D
Re: (Score:3)
Re: (Score:1)
Thanks for your comment. I truly laughed out loud and enjoyed it!
Re: (Score:2)
You didn't submit an executable archive of enwik9 purported to expand into a file that matched bit for bit. While you also failed in some other ways to qualify, that is now the first bar you must clear, before any further investment by the judging committee.
Or are you Yann LeCun out to pull my leg again?
https://twitter.com/ylecun/sta... [twitter.com]
Re: Yes it is (Score:2)
You are thinking of *mebibytes*
https://tecadmin.net/differenc... [tecadmin.net]
I can definitely beat that compression ratio (Score:2)
42
There, done!
That writing style is indicative of craziness (Score:1)
If you're not alarmed when you see text with so many words emphasized, you may be on the wrong side of that divide. People who sprinkle bold (and worse, color) all over their texts to make a point usually do so because they feel they're not being understood, and usually for good reason.
Re: (Score:1)
Very puzzling... (Score:2, Interesting)
There are obvious reasons to demand lossless for a variety of practical compression applications; but in the context of 'comprehension' requiring losslessness is essentially identical to forbidding comprehension; since losslessness implies that all bits are of equal importance, since you have to be able to reproduce them all; while everything we've seen that looks l
Re: Very puzzling... (Score:2)
So what you are saying is we could still read wikipedia if th chratrs wer nt al der jus fin.
No one wants to do that even if we can read it.
Re: (Score:2)
At least 2/3rds of your sentence is fully reconstructable by running it through a context free spell checker. A contextual based spell checker would be able to reconstruct the entire sentence.
Basically if you're able to understand what you just wrote then we are capable of reconstructing it so you don't need to worry about missing characters, preserving both the comprehension while also objectively being lossy in its compressive form.
The issue comes more with edge cases, e.g. where a missing comma could cha
Re: (Score:2)
Re: (Score:3)
The very thing you think these compressors are missing, they are doing internally better than you can imagine.
Requiring replication of the file is a measurement of the efficiency of the model it is using. Its the step that turns the model probabilities into knowledge. If the probabilities the model is giving are wro
Re: (Score:2)
I believe that OP was getting at the difference between memorization/knowledge and understanding/comprehension.
If you memorize a sentence, you can repeat it many times.
If you understand a sentence, you can express it many different ways.
The most efficient encoding may capture concepts and require extra bits to maintain specific word choices and sentence structures.
Re: (Score:3)
Re: (Score:2)
The idea is that by finding more ways in which data correlates, and therefore can be reduced to just the correlations rather than the actual data, the amount of general intelligence needed to understand it can be reduced too.
One of the big problems in AGI is what we might call common sense. Humans has a lot of life experience to draw on, they know how the world works on a deep level, to the point where they mostly don't have to consciously think about it. For AGI to have that understanding it was thought th
Txt Me (Score:2)
1 billion characters
Characters? Am I the only one who uses Wikipedia exclusively for the pictures and audio samples?!
Silicon Valley anyone? (Score:3)
This reminds me of Pied Piper's compression algorithm in the excellent HBO comedy "Silicon Valley" - I wonder if that was inspired by the Hutter Prize? One big mistake they made early on with the Hutter Prize was not insisting that all contestants make their entries Open Source.
Considering that later winners often built on the work of earlier winners, this caused a slowdown in winning entries. Fortunately, they corrected this mistake and the latest winner did indeed build on the work of a previous winner thanks to the latter's source code being available. This is a classic example where being closed source severely hinders progress.
Re: (Score:2)
IIRC, only one entry was closed source. You may be thinking of Matt Mahoney's Large Text Compression Benchmark [mattmahoney.net] where the top contender is frequently closed source.
That the machine learning world has yet to recognize lossless compression as the most principled loss function is a tragedy, but it is due to a lot more than that entry. This failure stretches back to when Solomonoff's
Re: (Score:1)
Hi Baldrson,
I find that very interesting. It's a different take on epistemology & philosophy of science.
It's the first time I read about this.
Are there other interesting reference works in this line of thinking?
Re: (Score:2)
Probably the best introduction would be Hutter's 2010 paper titled "A Complete Theory of Everything (Will Be Subjective) [arxiv.org]". From there you can follow its citations in google scholar [google.com].
Ya, but in the spirit of things (Score:2)
Shouldn't it really be "Huter Prize" -- which is 8.3% shorter? :-)
Middle out, obviously (Score:3, Funny)
... as we learned in Silicon Valley
Stop Beating Around the Bush... (Score:2)