Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Data Storage AI Wikipedia

New Hutter Prize Winner Achieves Milestone for Lossless Compression of Human Knowledge (mattmahoney.net) 60

Since 2006 Baldrson (Slashdot reader #78,598) has been part of the team verifying "The Hutter Prize for Lossless Compression of Human Knowledge," an ongoing challenge to compress a 100-MB excerpt of Wikipedia (approximately the amount a human can read in a lifetime).

"The intention of this prize is to encourage development of intelligent compressors/programs as a path to Artificial General Intelligence," explains the project's web site. 15 years ago, Baldrson wrote a Slashdot post explaining the logic (titled "Compress Wikipedia and Win AI Prize"): The basic theory, for which Hutter provides a proof, is that after any set of observations the optimal move by an AI is find the smallest program that predicts those observations and then assume its environment is controlled by that program. Think of it as Ockham's Razor on steroids.
The amount of the prize also increases based on how much compression is achieved. (So if you compress the 1GB file x% better than the current record, you'll receive x% of the prize...) The first prize was awarded in 2006. And now Baldrson writes with the official news that this Spring another prize was claimed after reaching a brand new milestone: Artemiy Margaritov's STARLIT algorithm's 1.13% cleared the 1% improvement hurdle to beat the last benchmark, set by Alexander Rhatushnyak. He receives a bonus in proportion to the time since the last benchmark was set, raising his award by 60% to €9000. [$10,632 USD]

Congratulations to Artemiy Margaritov for his winning submission!

This discussion has been archived. No new comments can be posted.

New Hutter Prize Winner Achieves Milestone for Lossless Compression of Human Knowledge

Comments Filter:
  • Interesting concept (Score:5, Interesting)

    by AmiMoJo ( 196126 ) on Monday July 19, 2021 @07:16AM (#61596857) Homepage Journal

    The idea that you can use prediction (AI) to help improve compression is quite old but also quite promising. Essentially if you could train an AI to write like Dickens then it could reproduce the works of Dickens, or very nearly. If it's not 100% perfect you can include some additional correction data.

    That's kinda what FLAC does for audio. Lossy compression and then compress the difference between the lossy and the real data. It all hinges on how good the lossy compression is at producing output where the difference between lossy and real data is highly compressible.

    I see the winning entry is using Tensorflow for learning. The base cmix code it is built from includes over 2000 models for different types of data, but has been cut down to work optimally just with the specific text in the challenge. It's a shame there was not a greater mix of data types.

    • Then you find out that Dickens original work was full of off color humor and it was his editor who made the crap barely readable.

      • Then you find out that Dickens original work was full of off color humor and it was his editor who made the crap barely readable.

        Perhaps so, but then it's also his editor that's responsible for the ongoing widespread appeal of Dickens over a hundred years later, rather than leaving him to be some niche author that stood out in his day for the wrong reasons and was soon forgotten.

    • by GuB-42 ( 2483988 )

      Most (all?) of the best compression algorithms in terms of compression ratio work the same way, with only two steps
      1- Try to guess the next symbol, a symbol can be a bit, a byte, or any element of information, assign a probability to each possible outcome.
      2- Using an arithmetic coder write the actual result with a number of bits inversely proportional to how probable it was based on step 1. Note that by using an arithmetic coder, the number of bits doesn't have to be an integer, a symbol can take a quarter

      • That was indeed how some/many compressors used to work (along with string matching, which was also highly effective and quite different, usually the two are combined).
        Now however things have progressed, as single token compression is highly limited (even with matching contexts as above) - bleeding edge is now much more about tuned generators that, with some small guidance, produce extended amounts of content - this is a much much more powerful but complex approach that statistical forward prediction.

        And yes

    • The prize algorithm is interesting too since they will never have to pay out 100% of the prize since this would mean shrinking the data to zero.
    • Nope. As far as I can see, the intelligence part of this assumption is fundamentally flawed: Language is a symbolic system, i.e. semantic & pragmatic relationship to it are semi-systematic but arbitrary. It's the differences between locution, illicution, & perlocution (speech act theory). P-hacking large datasets will reveal regularities in the signs & their configurations but not necessarily meaningful ones, i.e. as you & I would parse & understand them. Maybe there's more to it than TF
      • by epine ( 68316 )

        You've fallen into the pit of parochialism which plagues the entire field of hard AI. You see regularities that the program does not. The program sees regularities that you do not. But not to worry. You have a parochial type theory which places your regularities on a higher rung. No reason. But it feels good.

        Egorithm:
        * if the mechanical intelligence is similar enough to human biological intelligence, then it falls short of being entirely human
        * if the mechanical intelligence is insufficiently similar to hum

        • Nope. Just psycholinguistics 101. What they're describing is programming a machine to literally 'ape' human behaviour, i.e. to imitate it without understanding what it means or what social function(s) it serves. If that's your idea of intelligence, it's vacuous.
    • The idea that you can use prediction (AI) to help improve compression is quite old but also quite promising.

      The contest is even older that and is motivated by the idea that good compression is equivalent to learning. See https://en.wikipedia.org/wiki/... [wikipedia.org]

      Of course, it's interesting that things are now backwards and they are using AI to get the good compression. This seems to coincide with the rise of deep learning.

      Essentially if you could train an AI to write like Dickens then it could reproduce the w

    • What the Dickens did you just say?
  • I note as an honorable mention that http://mattmahoney.net/dc/text.html#1102 lists as the smallest output nncp, by the most excellent Fabrice Bellard (of Lzexe, and JSlinux fame). This seems to be disqualified perhaps by requiring GPUs.

    This is 110M vs the 115 of the above.
  • One A4 page is about 2K of text. So 500 pages are one MB. That makes 100 MB about 50.000 pages. That doesn't sound much.

    I have read (until now, death not reached yet) a lot more than 3.000 books (that's the ones I have owned in paper). Even if we assume rather normal books (80.000 chars) thats more than double that amount.

    So that 100 MB lifetime amount seems to be rather pessimistic.

    P.S. Switched to a Kindle several years back as the paper based books were too much of burden. My wife and I kept about 1.000

    • by canajin56 ( 660655 ) on Monday July 19, 2021 @07:34AM (#61596883)
      According to the link it's 1GB, not 100MB. It also doesn't call it "the amount a human can read in a lifetime", it calls it "hopefully representative".
      • OK, just a bad summary then â¦

      • by AmiMoJo ( 196126 )

        There are two data sets, a 100MB one and a 1GB one. The smaller is designed to be similar enough to the larger one that you can use it for development and testing, because the larger one takes days to compress with the current top ranking software.

        The ranking is done on the size of the larger one plus the size of the tool used to compress it. There are some other limits like maximum RAM that can be used, maximum temporary disk space etc.

      • If the corpus size is 1GB then I can compress it to 1 bit using an executable slightly over 1GB in size. This is allowed by the terms of the competition AFAICT.
        • It is, but the size of the executable is added to the compressed data when determining how successful it was at compressing the data set, almost certainly because they're aware someone would try to game the system in the manner you're suggesting.

          • Fine. I will just output a symbol that I define to be that particular 1GB of information.
            • You don't define it to be that 1GB of information, you have it identify as that 1GB of information, then no-one will dare challenge you over it.

              I could also do the Compression by Citation (to use its technical name) with an application that's only a few hundred kB in size but that consumes 1GB of network traffic when run.

              It's actually more fun coming up with hacks than it is to spend months or years shaving another 0.015% off the compression ratio.

        • by kmoser ( 1469707 )
          I can compress all of Wikipedia to just 13 bytes: wikipedia.org.
      • Regardless of that, very few people read 30,000 books. Maybe 30,000 pages. Heck half of Slashdot doesn't even read the one paragraph summary and supposedly we're nerds.

        Typical people have a limit of about 140 characters before they want a tldr; which is one reason Twitter has hundreds of millions of users.

        Sure, I read a couple hundred pages of non-fiction a week, but I'm a freak. Few people do.

    • But if your A4 page has images on it, that will take a lot more space on a computer. ASCII/Unicode/whatever if you think about it is already a very highly effective form of compression.

    • "So that 100 MB lifetime amount seems to be rather pessimistic."

      Not for millennials, for them it's optimistic.

  • Text compression works by accumulating context information in a memory buffer. The accumulated context information is used for predicting the next symbol/bit. STARLIT makes better predictions by first ordering the articles according to similiarity. From the GitHub page:

    > During the searching phase, firstly, each article is mapped to a feature vector using a Doc2Vec model. Secondly, considering each feature vector to be a point in a Euclidean space, the Traveling Salesman Problem is solved resulting in t

    • by I3OI3 ( 1862302 )
      I did some research in this area (and have a patent for something similar, but different enough), and it's pretty cool.

      The next step to generalization is to analyze the vector space to see which dimensions (or vectors using PCA or something similar) are most predictive of compression ratios, and build a fast extractor for just those dimensions. Then run docs through your extractor, project it into a high-dimensional Hilbert space and BOOM! great compression done super-fast.
  • I don't get why the big fuss on storing the data to be Lossless for AI progress. The human brain looses a lot of information, I am not talking about information that we have forgot, but the way we remember things in general. The reason why Loss compression algorithms for video and audio are used all the time, we are just wired to pick up that much information and we filter out ourselves a lot of stuff unless we are really focused on it.

    You have an Audiophile walk into noisy bar to get get a drink and chat

    • by Junta ( 36770 )

      Think the AI angle is a bit silly on this front, but also in general this is an oddly specific endeavor without much real world applicability (trying to losslessly compress a specific corpus of material that will compromise performance on other real-world text not in that corpus). It reminds me of https://www.dangermouse.net/es... [dangermouse.net] for best image compression as measured by ability to compress the Lenna reference image (if empty file, spit out lenna, else decode like normal image file). So Lenna compresses t

      • by DavenH ( 1065780 )
        Don't fall into a Dunning-Kruger trap here. Read the paper [arxiv.org] and understand why compression and AI are intrinsically linked.
        • by Junta ( 36770 )

          I don't know about intrinsically linked. The paper proposes that a good AI would be able to compress text better than a symbol compression scheme, because humans store at about 1 bit per character (I find this declaration specifically dubious, but that's not pertinent to this). So this challenge is proposed as some sort of litmus test for quality of an AI.

          The paper back in 2007 you cited points out some concern that this particular parlor trick does not necessarily say anything about a 'general AI'. I think

          • by DavenH ( 1065780 )
            I realized too late I'd googled the wrong paper, but it's a worthy read in any case. I meant to post the one describing AIXI the agent that uses optimal compression with Solomonoff Induction. In any case, it's mathematically proven to be asymptotically optimal -- insuperable over long time scales and across all environments -- for a general agent. That's a lot to unpack, but it does imply how instrumentally important compression is.

            this post [slashdot.org] describes loosely how lossy compression and lossless are functio

            • You compress it so that it has most of the same meaning as before. The gist of it anyway. Now that would be impressive.

              Video and audio compression lose heaps, but they produce something that sounds like the original.

              A really smart compression might store the scene like it was in the script notes, e.g. "man rides horse into sunset". Then a scene like that is created. Looks completely different from the original, but has the same emotional impact...

    • While some people with photographic memory may remember a lot more details than others, there is still a fair amount of mental reduction going on.

      True photographic memory definitely does not exist - at least, no one has ever demonstrated it in controlled tests, which is the same difference.

      Eidetic memory, persistent visual images seen in children are not photographic, they are also reconstructions which can be shown by invented detail reported by the subject.

      The extremely rare people with persistent hyperthymesia (60 subjects ever found to date) are "poor forgetters", for the first few days after personal experiences their memory is exactly like norm

    • by DavenH ( 1065780 )
      You should read the FAQ to the Hutter Prize. He answers this one - lossy and lossless compression are nearly the same.

      A great lossless codec implies a great lossy codec, in all but the most trivial cases. That is because to achieve optimal compression, you need to have "lossy" compression do the bulk of the high-level representation of the data, and then to achieve lossless you fill in the gaps from the target with what amounts to small-amplitude noise. Because a better lossy codec will reduce the amplitud

    • Lossy compression requires a utility function. That resides in the decision theoretic framework (engineering) not the prediction framework (science). The Hutter Prize is about the latter (Algorithmic Information Theory).

    • "As a kid often will ask a lot of redundant questions. Does a Dog Eat, Does a Cat Eat, Does a Cow Eat.... At some point their brain will compress that data to All Animals Eat, so when they get older ..."

      It knows that it lives in a dog-eats-dog world.

    • and then someday you and the kids might learn that in fact there are animals that don't eat. Also ones that can go without eating for decades.

      Making absolute statements in biology often is a mistake.

      • Making absolute statements for anything is often a mistake. However, we can shortcut our learning with a general rule, and apply the rare exceptions for the oddball case.

        I can think All Mammals give live birth... Except for Mammals in the order of Monotreme. However if I run into a random Mammal, having me assume that it will give live birth would be a general idea. If it comes across me that it doesn't Ill just add it to the exception list (which is much smaller than the general rule algorithm).

        I find

        • Educated people with proper attitude love for their world to get thrown out of whack with new discoveries. That biosphere of ours is a crazy and wild place, and many things about the universe are still a mystery and we know our best models are wrong.

  • Or other general purpose compression algorithm?

    • You can easily see it at the linked website: Classic ZIP compresses the 100 MB corpus to 36% of its original size. Bzip2 goes down to 29%. xz (lzma2) leaves things at 24%. Strangely, 7zip, which should use a similar algorithm as xz manages to get it to 21%. The entry mentioned in this story goes down to 15%.
      I'd be careful with the numbers of the well known compresssors since for some they seem to have used pretty old versions although I'm not sure if newer versions modify the compression algorithms to cha
  • I had really hoped the universe would have corrected that bug by now.

  • I would not be surprised of Fabrice could code up a compression system that stores 100MB of Wikipedia in 42 bytes, plus a checksum.

    Fabrice Bellard,
    The Chuck Norris of programming.

    Cheers.

  • Gilfoyle:
    Look, you have two guys on either side with their d*cks, tip to tip, so you're going full-length. Four, see?

    Jared:
    Oh... From the middle out. That does make sense.

    Gilfoyle:
    Like two Shake Weights.

    https://youtu.be/Ex1JuIN0eaA [youtu.be]

  • OK. I'll also think of it as Occam's Razor on steroids.
  • They call it "God" or "Allah".

  • You can't compress all of human knowledge much beyond that!

    Well, binary would be more compact, 101010.

Keep up the good work! But please don't ask me to help.

Working...