New Hutter Prize Winner Achieves Milestone for Lossless Compression of Human Knowledge (mattmahoney.net) 60
Since 2006 Baldrson (Slashdot reader #78,598) has been part of the team verifying "The Hutter Prize for Lossless Compression of Human Knowledge," an ongoing challenge to compress a 100-MB excerpt of Wikipedia (approximately the amount a human can read in a lifetime).
"The intention of this prize is to encourage development of intelligent compressors/programs as a path to Artificial General Intelligence," explains the project's web site. 15 years ago, Baldrson wrote a Slashdot post explaining the logic (titled "Compress Wikipedia and Win AI Prize"): The basic theory, for which Hutter provides a proof, is that after any set of observations the optimal move by an AI is find the smallest program that predicts those observations and then assume its environment is controlled by that program. Think of it as Ockham's Razor on steroids.
The amount of the prize also increases based on how much compression is achieved. (So if you compress the 1GB file x% better than the current record, you'll receive x% of the prize...) The first prize was awarded in 2006. And now Baldrson writes with the official news that this Spring another prize was claimed after reaching a brand new milestone: Artemiy Margaritov's STARLIT algorithm's 1.13% cleared the 1% improvement hurdle to beat the last benchmark, set by Alexander Rhatushnyak. He receives a bonus in proportion to the time since the last benchmark was set, raising his award by 60% to €9000. [$10,632 USD]
Congratulations to Artemiy Margaritov for his winning submission!
"The intention of this prize is to encourage development of intelligent compressors/programs as a path to Artificial General Intelligence," explains the project's web site. 15 years ago, Baldrson wrote a Slashdot post explaining the logic (titled "Compress Wikipedia and Win AI Prize"): The basic theory, for which Hutter provides a proof, is that after any set of observations the optimal move by an AI is find the smallest program that predicts those observations and then assume its environment is controlled by that program. Think of it as Ockham's Razor on steroids.
The amount of the prize also increases based on how much compression is achieved. (So if you compress the 1GB file x% better than the current record, you'll receive x% of the prize...) The first prize was awarded in 2006. And now Baldrson writes with the official news that this Spring another prize was claimed after reaching a brand new milestone: Artemiy Margaritov's STARLIT algorithm's 1.13% cleared the 1% improvement hurdle to beat the last benchmark, set by Alexander Rhatushnyak. He receives a bonus in proportion to the time since the last benchmark was set, raising his award by 60% to €9000. [$10,632 USD]
Congratulations to Artemiy Margaritov for his winning submission!
Interesting concept (Score:5, Interesting)
The idea that you can use prediction (AI) to help improve compression is quite old but also quite promising. Essentially if you could train an AI to write like Dickens then it could reproduce the works of Dickens, or very nearly. If it's not 100% perfect you can include some additional correction data.
That's kinda what FLAC does for audio. Lossy compression and then compress the difference between the lossy and the real data. It all hinges on how good the lossy compression is at producing output where the difference between lossy and real data is highly compressible.
I see the winning entry is using Tensorflow for learning. The base cmix code it is built from includes over 2000 models for different types of data, but has been cut down to work optimally just with the specific text in the challenge. It's a shame there was not a greater mix of data types.
Re: (Score:2)
Then you find out that Dickens original work was full of off color humor and it was his editor who made the crap barely readable.
Re: (Score:3)
Then you find out that Dickens original work was full of off color humor and it was his editor who made the crap barely readable.
Perhaps so, but then it's also his editor that's responsible for the ongoing widespread appeal of Dickens over a hundred years later, rather than leaving him to be some niche author that stood out in his day for the wrong reasons and was soon forgotten.
Re: (Score:2)
Most (all?) of the best compression algorithms in terms of compression ratio work the same way, with only two steps
1- Try to guess the next symbol, a symbol can be a bit, a byte, or any element of information, assign a probability to each possible outcome.
2- Using an arithmetic coder write the actual result with a number of bits inversely proportional to how probable it was based on step 1. Note that by using an arithmetic coder, the number of bits doesn't have to be an integer, a symbol can take a quarter
Sorry, not any more. (Score:2)
That was indeed how some/many compressors used to work (along with string matching, which was also highly effective and quite different, usually the two are combined).
Now however things have progressed, as single token compression is highly limited (even with matching contexts as above) - bleeding edge is now much more about tuned generators that, with some small guidance, produce extended amounts of content - this is a much much more powerful but complex approach that statistical forward prediction.
And yes
Prize Algorithm Interesting Too (Score:2)
Re: Interesting concept (Score:2)
Re: (Score:2)
You've fallen into the pit of parochialism which plagues the entire field of hard AI. You see regularities that the program does not. The program sees regularities that you do not. But not to worry. You have a parochial type theory which places your regularities on a higher rung. No reason. But it feels good.
Egorithm:
* if the mechanical intelligence is similar enough to human biological intelligence, then it falls short of being entirely human
* if the mechanical intelligence is insufficiently similar to hum
Re: (Score:2)
Re: (Score:2)
The contest is even older that and is motivated by the idea that good compression is equivalent to learning. See https://en.wikipedia.org/wiki/... [wikipedia.org]
Of course, it's interesting that things are now backwards and they are using AI to get the good compression. This seems to coincide with the rise of deep learning.
Re: (Score:1)
But with GPUs! (Score:2)
This is 110M vs the 115 of the above.
Re: But with GPUs! (Score:2)
It uses Libnc which can run on CPU, but it's closed source. That's what disqualifies it.
Re: (Score:2)
Which seems very odd. Libnc says it's just an auto-differentiating tensor library. For 10k I think I'd be happy to translate my thing to use one of the many open source tensor math libraries.
Re: (Score:2)
Really? 100MB Lifietime? (Score:2)
One A4 page is about 2K of text. So 500 pages are one MB. That makes 100 MB about 50.000 pages. That doesn't sound much.
I have read (until now, death not reached yet) a lot more than 3.000 books (that's the ones I have owned in paper). Even if we assume rather normal books (80.000 chars) thats more than double that amount.
So that 100 MB lifetime amount seems to be rather pessimistic.
P.S. Switched to a Kindle several years back as the paper based books were too much of burden. My wife and I kept about 1.000
Re:Really? 100MB Lifietime? (Score:5, Informative)
Re: Really? 100MB Lifietime? (Score:2)
OK, just a bad summary then â¦
Re: (Score:3)
OK, just a bad summary then
Welcome to /.
Re: (Score:2)
There are two data sets, a 100MB one and a 1GB one. The smaller is designed to be similar enough to the larger one that you can use it for development and testing, because the larger one takes days to compress with the current top ranking software.
The ranking is done on the size of the larger one plus the size of the tool used to compress it. There are some other limits like maximum RAM that can be used, maximum temporary disk space etc.
Re: (Score:2)
Re: (Score:2)
It is, but the size of the executable is added to the compressed data when determining how successful it was at compressing the data set, almost certainly because they're aware someone would try to game the system in the manner you're suggesting.
Re: (Score:1)
Re: (Score:2)
You don't define it to be that 1GB of information, you have it identify as that 1GB of information, then no-one will dare challenge you over it.
I could also do the Compression by Citation (to use its technical name) with an application that's only a few hundred kB in size but that consumes 1GB of network traffic when run.
It's actually more fun coming up with hacks than it is to spend months or years shaving another 0.015% off the compression ratio.
Re: (Score:2)
Also, we're not normal. 140 characters is normal (Score:2)
Regardless of that, very few people read 30,000 books. Maybe 30,000 pages. Heck half of Slashdot doesn't even read the one paragraph summary and supposedly we're nerds.
Typical people have a limit of about 140 characters before they want a tldr; which is one reason Twitter has hundreds of millions of users.
Sure, I read a couple hundred pages of non-fiction a week, but I'm a freak. Few people do.
Re: (Score:2)
But if your A4 page has images on it, that will take a lot more space on a computer. ASCII/Unicode/whatever if you think about it is already a very highly effective form of compression.
Re: Really? 100MB Lifietime? (Score:2)
I was assuming the most efficient use as 100 MB of text. And even there it does not hold up. It is easy, even for an illiterate to read 100 MB in pictures ðY
Re: (Score:2)
"So that 100 MB lifetime amount seems to be rather pessimistic."
Not for millennials, for them it's optimistic.
Uses traveling salesman on Doc2Vec vectors (Score:1)
Text compression works by accumulating context information in a memory buffer. The accumulated context information is used for predicting the next symbol/bit. STARLIT makes better predictions by first ordering the articles according to similiarity. From the GitHub page:
> During the searching phase, firstly, each article is mapped to a feature vector using a Doc2Vec model. Secondly, considering each feature vector to be a point in a Euclidean space, the Traveling Salesman Problem is solved resulting in t
Re: (Score:1)
The next step to generalization is to analyze the vector space to see which dimensions (or vectors using PCA or something similar) are most predictive of compression ratios, and build a fast extractor for just those dimensions. Then run docs through your extractor, project it into a high-dimensional Hilbert space and BOOM! great compression done super-fast.
Re: (Score:1)
compress a 100-MB excerpt vs compress the 1GB file (Score:1)
"Editors"
Human Brains store lossy information. (Score:2)
I don't get why the big fuss on storing the data to be Lossless for AI progress. The human brain looses a lot of information, I am not talking about information that we have forgot, but the way we remember things in general. The reason why Loss compression algorithms for video and audio are used all the time, we are just wired to pick up that much information and we filter out ourselves a lot of stuff unless we are really focused on it.
You have an Audiophile walk into noisy bar to get get a drink and chat
Re: (Score:3)
Think the AI angle is a bit silly on this front, but also in general this is an oddly specific endeavor without much real world applicability (trying to losslessly compress a specific corpus of material that will compromise performance on other real-world text not in that corpus). It reminds me of https://www.dangermouse.net/es... [dangermouse.net] for best image compression as measured by ability to compress the Lenna reference image (if empty file, spit out lenna, else decode like normal image file). So Lenna compresses t
Re: (Score:2)
Re: (Score:2)
I don't know about intrinsically linked. The paper proposes that a good AI would be able to compress text better than a symbol compression scheme, because humans store at about 1 bit per character (I find this declaration specifically dubious, but that's not pertinent to this). So this challenge is proposed as some sort of litmus test for quality of an AI.
The paper back in 2007 you cited points out some concern that this particular parlor trick does not necessarily say anything about a 'general AI'. I think
Re: (Score:2)
this post [slashdot.org] describes loosely how lossy compression and lossless are functio
Lossy compression would be harder (Score:2)
You compress it so that it has most of the same meaning as before. The gist of it anyway. Now that would be impressive.
Video and audio compression lose heaps, but they produce something that sounds like the original.
A really smart compression might store the scene like it was in the script notes, e.g. "man rides horse into sunset". Then a scene like that is created. Looks completely different from the original, but has the same emotional impact...
Re: (Score:2)
While some people with photographic memory may remember a lot more details than others, there is still a fair amount of mental reduction going on.
True photographic memory definitely does not exist - at least, no one has ever demonstrated it in controlled tests, which is the same difference.
Eidetic memory, persistent visual images seen in children are not photographic, they are also reconstructions which can be shown by invented detail reported by the subject.
The extremely rare people with persistent hyperthymesia (60 subjects ever found to date) are "poor forgetters", for the first few days after personal experiences their memory is exactly like norm
Re: (Score:3)
A great lossless codec implies a great lossy codec, in all but the most trivial cases. That is because to achieve optimal compression, you need to have "lossy" compression do the bulk of the high-level representation of the data, and then to achieve lossless you fill in the gaps from the target with what amounts to small-amplitude noise. Because a better lossy codec will reduce the amplitud
Re: Human Brains store lossy information. (Score:2)
Lossy compression requires a utility function. That resides in the decision theoretic framework (engineering) not the prediction framework (science). The Hutter Prize is about the latter (Algorithmic Information Theory).
Re: (Score:2)
"As a kid often will ask a lot of redundant questions. Does a Dog Eat, Does a Cat Eat, Does a Cow Eat.... At some point their brain will compress that data to All Animals Eat, so when they get older ..."
It knows that it lives in a dog-eats-dog world.
Re: (Score:1)
and then someday you and the kids might learn that in fact there are animals that don't eat. Also ones that can go without eating for decades.
Making absolute statements in biology often is a mistake.
Re: (Score:2)
Making absolute statements for anything is often a mistake. However, we can shortcut our learning with a general rule, and apply the rare exceptions for the oddball case.
I can think All Mammals give live birth... Except for Mammals in the order of Monotreme. However if I run into a random Mammal, having me assume that it will give live birth would be a general idea. If it comes across me that it doesn't Ill just add it to the exception list (which is much smaller than the general rule algorithm).
I find
Re: (Score:1)
Educated people with proper attitude love for their world to get thrown out of whack with new discoveries. That biosphere of ours is a crazy and wild place, and many things about the universe are still a mystery and we know our best models are wrong.
Re: (Score:2)
Really, I can realistically condense every article on Wikipedia down to 10% of its size, losing not one single element of information.
Ok. Since Wikipedia is the "the free encyclopedia that anyone can edit", I suggest you get to work.
how does it compares to lzma2 (Score:2)
Or other general purpose compression algorithm?
Re: (Score:2)
I'd be careful with the numbers of the well known compresssors since for some they seem to have used pretty old versions although I'm not sure if newer versions modify the compression algorithms to cha
Fuck, is Baldrson still alive? (Score:2)
I had really hoped the universe would have corrected that bug by now.
This looks like a challenge for Fabrice Bellard (Score:1)
I would not be surprised of Fabrice could code up a compression system that stores 100MB of Wikipedia in 42 bytes, plus a checksum.
Fabrice Bellard,
The Chuck Norris of programming.
Cheers.
Middle out? (Score:2)
Gilfoyle:
Look, you have two guys on either side with their d*cks, tip to tip, so you're going full-length. Four, see?
Jared:
Oh... From the middle out. That does make sense.
Gilfoyle:
Like two Shake Weights.
https://youtu.be/Ex1JuIN0eaA [youtu.be]
Think of it as Ockham's Razor on steroids (Score:1)
Lots of people do thia already (Score:1)
They call it "God" or "Allah".
42 (Score:2)
You can't compress all of human knowledge much beyond that!
Well, binary would be more compact, 101010.