Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!


Forgot your password?
Data Storage Hardware Linux

Writing Linux Kernel Functions In CUDA With KGPU 101

An anonymous reader writes "Until today, GPGPU computing was a userspace privilege because of NVIDIA's closed-source policy and AMD's semi-open state. KGPU is a workaround to enable Linux kernel functionality written in CUDA. Instead of figuring out GPU specs via reverse-engineering, it simply uses a userspace helper to do CUDA-related work for kernelspace requesters. A demo in its current source repository is a modified eCryptfs, which is an encrypted filesystem used by Ubuntu and other distributions. With the accelerated performance of a GPU AES cipher in the Linux kernel, eCryptfs can get a 3x uncached read speedup and near 4x write speedup on an Intel X25-M 80G SSD. However, both the GPU cipher-based eCryptfs and the CPU cipher-based one are changed to use ECB cipher mode for parallelism. A CTR, counter mode, cipher may be much more secure, although the real vanilla eCryptfs uses CBC mode. Anyway, GPU vendors should think about opening their drivers and computing libraries, or at least providing a mechanism to make it easy to do GPU computing inside an OS kernel, given the fact that GPUs are so widely deployed and the potential future of heterogeneous operating systems."
This discussion has been archived. No new comments can be posted.

Writing Linux Kernel Functions In CUDA With KGPU

Comments Filter:
  • Wonder how this compares in performance to AES-NI [wikipedia.org], because it sure as hell sounds a lot more complex and fragile.
    • by adisakp ( 705706 )
      It might be more "complicated" but it's probably more useful since currently a lot more systems have GPU's than AES-NI, given that AES-NI is only on a subset of Intel's most recents CPU's.
      • The problem is that parallelized encryption is not as secure as the other modes. Let me show you the difference between CBC, ECB and CTR ( block(i) means the i'th block of data)

        1) CBC
        CBC(pwd, block(i)) == encrypt(pwd, block(i)) xor block(i-1)
        * block(-1) = hash(pwd, 0) (sometimes half the password is used as block(-1))

        2) ECB
        ECB(block(i)) = encrypt(block(i))

        3) CTR
        CTR(block(i)) = encrypt(block(i)) xor i

        I hope it's obvious why CBC and CTR are the only candidates for parallelization. CBC

        • by slew ( 2918 )

          Just a couple small nits to pick..

          Although CBC encryption needs to be done in sequence, CBC decryption can be done mostly in parallel (don't have to wait until you do the AES part of the previous block)...

          Also security is better than other modes only in some cases. As a trivial example, in CBC it's easier to tamper with the plain-text.: all you have to do to flip a bit in the plaintext of a CBC encrypted stream is to flip that same bit in the previous block's cipher-text. Although that kills that previou

        • 3) CTR
              CTR(block(i)) = encrypt(block(i)) xor i

          Sorry, but what you describe is not CTR mode. Using your notation, CTR would look (roughly) like this:

              CTR(block(i)) = encrypt(counter) xor block(i)

          where "counter" is usually constructed by concatenating a nonce value with i
          (the block number). It is critical that the resulting counter never be re-used
          with the same key for a different block).

        • by kasperd ( 592156 )

          CTR(block(i)) = encrypt(block(i)) xor i

          That's not how CTR works. Rather it works like

          CTR(block(i)) = encrypt(IV || i) xor block(i)

          However since most storage encryptions cheat and use an IV that is the same every time you write to the same logical sector, the CTR mode will actually turn into a pseudorandom one-time-pad. This means if you ever write to the same logical sector number twice, you are potentially leaking data. In the case of ecryptfs it is probably only a problem if you overwrite sectors in a

          • I'm curious, would CTR be less vulnerable if one XORed before encryption? Call the operation CXR.
            Where ^ is the XOR operator
            CXR(block(i)) = encrypt(IV ^ i ^ block(i))

            I'm not sure if there is analysis that can be done on the block at that point that makes this undesirable. Methinks not because as far as I know having a well known IV in, say, CBC is not a vulnerability. That implies to me that the security still rests firmly in the key. At the very least it stops being vulnerable to bitwise changes and reinst

            • by kasperd ( 592156 )

              CXR(block(i)) = encrypt(IV ^ i ^ block(i))

              This is about as secure as ECB, but that's still better than what you get from incorrect use of CTR that degenerates to multiple use of a one-time-pad. What you want is a tweakable block cipher. Just use the block using i as tweak. That is how LRW mode works, with a specific construction for the tweakable block cipher.

              One of the constructions for the tweakable block cipher is encrypt(t ^ encrypt(plaintext)), a more efficient construction (but requires a larger ke

    • Re: (Score:2, Informative)

      by Anonymous Coward

      KGPU uses AES just as a demonstration, it's architecture is general to any GPU-friendly algorithm.

    • by DarkOx ( 621550 )

      Well I am sure it compares very favorably if you have an old CPU or a CPU of a different architecture which does not feature those instructions.

    • It's for an entirely different application. AES-NI is one application specific set of instructions. While encryption and decryption is an application in which dedicated hardware can have tremendous gains, introducing dozens of application specific hardware modules into a CPU is going to fall to diminishing returns, and just result in an oversized, expensive, and power hungry CPU. It's an inherently limiting design methodology. Introducing GPU access to the kernel opens up a very powerful piece of hardwa

      • by makomk ( 752139 )

        introducing dozens of application specific hardware modules into a CPU is going to fall to diminishing returns, and just result in an oversized, expensive, and power hungry CPU

        More oversized, expensive and power-hungry than the GeForce GTX 480 they used for this benchmark? It's right at the limits of manufactuability in terms of chip size, costs hundreds of dollars, and has a 300W power consumption at load. You'd need an awful lot of application-specific hardware modules before you even got close to that.

  • by Anonymous Coward

            Hand off encryption routines to a closed source black box. Brilliant.

  • Question: (Score:4, Interesting)

    by Jaqenn ( 996058 ) on Friday May 06, 2011 @04:11PM (#36051338)
    (I have never written kernel level code, and the statement that follows is only from listening to what other people are doing)

    I thought that a tiny bit of kernel code reflecting calls into a user level process was old news, and has become established as the preferred development model. Is there a reason that it's undesirable?

    Because the summary makes it sound like we're sad to be following this model, and we're only doing it because we can't pull NVidia's driver source into the linux kernel.
    • by sockman ( 133264 )

      The NVIDIA extensions are only available in userland.
      So a call to the kernel level crypto system gets routed back out to user land, and back to kernel land via the GPU module. That's why we're sad.

      • by sjames ( 1099 )

        What I would like to know is since they're already taking the hit for downcalls into userspace, why not use fuse instead and let the userland filesystem daemon use the GPU. Why produce yet another mechanism to protect the kernel from the wierdness that can happen when it depends on userspace rather than the other way around?

    • I've never written kernel modules either so take this with a grain of salt: my understanding is there is a cost associated with the switching/passing back and forth between userspace and kernelspace and it's best to minimize that. I remember similar discussions going back as far as NT4 when Microsoft decided to implement the entire GDI in kernelspace, which is what led to a billion BSODs because video drivers are notoriously shitty code and you'd be way better off stability-wise having that code run in user
      • by blair1q ( 305137 )

        Context-switching is always expensive, but avoiding it without regard to the actual benefit leads to system bloat, so learning where it is and isn't significant is a good skill to have.

        The speedup from GPU hardware is so big that it's worth giving up a few hundred cycles of context switching to get a few thousand cycles of reduction in computing.

        But (not having read TFA yet) I wonder just how much kernel functionality is really that parallelizable. When does the context switching cost you more than the CUD

        • Crypto stuff relying on gigundous keys would be a no-brainer, but where else could it be economical?

          Maybe RAID computations. Block-level data-deduplication is starting to catch on and that needs to hash every block written to disk. i bet that could benefit from a GPU but the userland overhead may be enough to kill the practicality, at least for anything but long streaming writes.

    • by Hatta ( 162192 )

      There is overhead in a context switch from kernel space to user space.

    • by afidel ( 530433 )
      The reason it's undesirable is the hit you taking when moving back and forth between kernel space and user space. The move in each direction requires the CPU to change ring levels which increases latency.
    • by Anonymous Coward
      Many developers feel that Nvidia's userspace driver workaround, only done to avoid licensing issues, shouldn't be permitted at all. This would be seen as validating Nvidia's actions.

      It's also a giant architectural hack so that won't help matters.
    • by Anonymous Coward

      a tiny bit of kernel code reflecting calls into a user level process

      You mean generally? This could be said of micro-kernels but the LInux kernel is monolithic; Drivers for devices typically live entirely inside the kernel.

      That being said I don't think it's necessarily desirable to pull every conceivable hardware interaction into the kernel. There is an endless variety of hardware and APIs. Why must all of this churn live in the kernel? The kernel<-->user-space bridge that was built to make the GPU vendors user-space API accessible by the kernel isolates the kernel

    • by emanem ( 1356033 )
      I've written kernel code in both OpenGL (GPGPU old school)/OpenCL.
      Main issue might be context switching? Or writing GPU binary code without having to compile via driver (i.e. a la math accelerator FPU?)
    • Re:Question: (Score:5, Interesting)

      by PoochieReds ( 4973 ) <jlayton AT poochiereds DOT net> on Friday May 06, 2011 @05:07PM (#36051836) Homepage

      There are also other concerns than the context switch overhead...particularly when dealing with filesystems or data storage devices.

      For instance, suppose part of your userspace daemon gets swapped out, and you now need to upcall to userspace. That part that got paged out then has to be paged back in. If memory is tight, then the kernel may have to free some memory, and it may decide to flush out dirty data to the filesystem or device that is dependent on the userspace daemon. At that point, you're effectively deadlocked.

      Most of those sorts of problems can be overcome with careful coding and making sure the important parts of the daemon are mlocked, but you do have to be careful and it's not always straightforward to do that.

  • by Anonymous Coward

    Until they open-source drivers, I refuse to buy them. Stuff like this is typically a nightmare to install and keep running anyway.

    • by blair1q ( 305137 )

      Just what are you using for graphics hardware, then? Intel's integrated core?

      • by jd ( 1658 )

        The Hercules graphics card. :)

      • I only used open source graphics drivers, including Intel's integrated, until about 6 months ago when I needed to run some OpenCL code on a Radeon. There is nothing wrong with Intel graphics and the opensource Radeon drivers, unless you are a gamer or need serious GPGPU power. Both are capable of plenty of 3D, for example molecular modelling in my case.

        I am posting this on a Powerbook running Linux, and for some strange reason AMD does not release binary drivers for PPC Linux ;) but the opensource Radeon

      • Just what are you using for graphics hardware, then? Intel's integrated core?

        Yes, why? I don't play 3D games, so it's fine and stable.

    • by gerddie ( 173963 )
      You might want to rethink your opinion on AMD, they are getting there: http://www.x.org/wiki/RadeonFeature [x.org]
    • by dbIII ( 701233 )
      It looks like the old SGI guys at Nvidia know that as soon as source is released they are going to get jumped on by patent trolls and have to spend a lot of time and money on pointless court cases that can do nothing of value to anyone apart from shifting money into patent troll pockets. They've been bitten once before and the closed drivers are the result.
  • I came to read a discussion of writing kernel functions in CUDA and a discussion about the vagaries of encryption methods broke out.
    • I came to read a discussion of writing kernel functions in CUDA and a discussion about the vagaries of encryption methods broke out.

      Be careful what you say, next we'll have a hockey game break out.

      I'm sorry, but am I the only one here who thinks this is, well, not a good way to go? Even if the code could be kernel-space code on the GPU? I mean, if I buy a CUDA GPU, I'm doing it because I have serious computing I want to do on it, not because I want my file system reads to be faster. I'd be rather miffed if I spent the time writing my CUDA code to speed things up and then found out it wasn't speeding things up because the GPU was alrea

      • They're racking their brains as to what to do next.

        I would aim for kernel threads running directly through CUDA and the Scheduler knowing the performance profile of suitable work for the GPU and the message-passing cost of moving work to the GPU^H^H^H parallelism co-processor. Make the interface right and you should be able to shift tasks across heterogeneous processing units. Do it perfectly and you can have a Linux Virtual Processor model which allows you to start running a task on your desktop, shuffle

      • by tibit ( 1762298 )

        You seem to be seriously overstating the impact of host-based printing. Obviously when you're not printing (and that's probably most of the time!), there's no overhead. And when you are printing, then the rasterizer consumes a little bit of memory and plenty of CPU, but that's transient. I would never venture as far as calling it "consuming" the computer.

        I haven't personally felt it to be a problem, and I'm using a host-based printer (HP LJ P1006). It spits out about 17 pages per minute, not too shabby if y

        • You seem to be seriously overstating the impact of host-based printing.

          Uhhh, no. I was there. Firsthand experience.

          Obviously when you're not printing (and that's probably most of the time!), there's no overhead.

          Other than the half a dozen monitor demons that tell you when there are updates for the drivers, when the printer is out of paper, when the printer is out of ink, when the printer is low on ink and would you like to buy official HP products now?, and whatever other things they had demons doing.

          then the rasterizer consumes a little bit of memory and plenty of CPU,

          The last 200Mb of disk is "a little bit"?

          I haven't personally felt it to be a problem,

          And thus it cannot have been a problem for me. Thanks.

          Having a CPU capable of rasterizing that fast in the printer itself would probably double its cost,

          Wow. A whole $100 for a printer. So the printer could actually be a print

          • by tibit ( 1762298 )

            The monitors and stuff are not a problem inherent in host-based printing. Not at all. For reasons better left to be explained by marketing types, HP's Windows printing support for home printer product line sucks donkey balls. Their support on Linux and Mac doesn't come with any of the overhead.

            So what you're complaining against is not host based printing per se, but broken drivers peddled by HP and others, bundled with bloatware. There's no inherent technical reason for it to be that way. And the problem is

          • by sjames ( 1099 )

            To be fair, most of that crap isn't actually the printer driver, it's the HP marketing trojan combined with REALLY bad design. A sane driver would only check paper and ink just before, during and just after a print job.

  • I wonder if this would be any faster than an implementation that took advantage of the hardware AES on the newer Intel CPU's? Latency should be lower for the CPU based version as would memory bandwidth.
  • by deadline ( 14171 ) on Friday May 06, 2011 @04:34PM (#36051568) Homepage
    Proof of concepts are nice, but when the GPU is firmly planted in the CPU, this will make more sense. The PCI bus can be a bottleneck in these types of situations. AMD fusion is a great example of this idea.
    • by cnettel ( 836611 )
      If you are indeed reading from something like an SSD, the data bandwidth shouldn't be a problem. The data pipe to any recent GPU is much wider than SATA, and quite favorable latency-wise as well. Of course, you are adding another layer of latency and transfers, but the situation is quite different from a case where you are offloading some computation whose data could otherwise stay in the CPU cache all the time.
  • Wow, the fragility of an encrypted file system plus the instability of a GPU, implemented in the kernel. Do not even read TFA without doing a full backup of your system.

  • As someone who's doing a lot of the same work, this is pretty spectacular! I'm surprised they get > 100MB/sec in software - but I guess that's due to using ECB mode vs. CBC. I think the real I/O limit here is probably in the user/kernel mem copies - context switch weight can be optimized with good buffer alignments.

    We did a lot of testing with CUDA under openssl 3-4 years ago - in the end it was better to just stick with software. The latencies are the real killers.

  • Is it a good idea for the protected kernel to rely on unprotected code for critical functions such as filesystem operations? I know that user-space code cannot directly interfere with the kernel, but it also doesn't have to do anything the kernel requests of it. Unless the kernel is designed to treat such user-space code as altogether untrustworthy, it seems to me a bad idea for the kernel to rely on user-space code in this manner.

  • by jasonwc ( 939262 ) on Friday May 06, 2011 @05:44PM (#36052172)

    I hope this is just a proof-of-concept design because ECB mode should not be used for this purpose. Wikipedia provides a pretty obvious example of the weakness of ECB mode:

    "The disadvantage of this method is that identical plaintext blocks are encrypted into identical ciphertext blocks; thus, it does not hide data patterns well. In some senses, it doesn't provide serious message confidentiality, and it is not recommended for use in cryptographic protocols at all. A striking example of the degree to which ECB can leave plaintext data patterns in the ciphertext is shown below; a pixel-map version of the image on the left was encrypted with ECB mode to create the center image, versus a non-ECB mode for the right image."

    http://en.wikipedia.org/wiki/Block_cipher_modes_of_operation#Initialization_vector_.28IV.29 [wikipedia.org]

    • I hope so too, because I was excited by the idea of using my CUDA-capable GPU to do encryption, which might actually get me to use it. It's barely ticking over providing Compiz functions.

    • And because a picture straight from the horse's mouth is worth a thousand words, here's what NVidia has to say about it:

      http://http.developer.nvidia.com/GPUGems3/gpugems3_ch36.html [nvidia.com]

      Go to 36.5, figure 36-11 & 36-13.

  • it doesn't obscure patterns in your input data. Please take a look at the tux images here; http://en.wikipedia.org/wiki/Block_cipher_modes_of_operation#Electronic_codebook_.28ECB.29 [wikipedia.org] (it may be faster, but it doesn't f---ing work.)
  • Why not OpenCL? (Score:4, Interesting)

    by gerddie ( 173963 ) on Friday May 06, 2011 @06:09PM (#36052388)
    They should go with OpenCL, then there would be a chance that at one point one can use it with free drivers (and other hardware), but I guess that's the prise you pay for a graduate fellowship from NVIDIA.
    • Came here to say this. Why the hell are they writing things in CUDA instead of OpenCL? CUDA is closed and Nvidia-proprietary!

  • by voss ( 52565 ) on Friday May 06, 2011 @06:40PM (#36052640)

    Imagine mysql database GPU accelerated...

    GPU accelerated routers, gpu acceleration of anti-virus software.
    The use of gpus to accelerate search engines.

    • Imagine whether prediction and stock prediction using these. I am surprised that the guys in New York haven't used it already given the massive amount of gold they have in their coffers.
      • by makomk ( 752139 )

        These days, automated stock trading is in fashion, which depends on having really tiny latencies - the exact opposite of what you get from GPU acceleration. I believe companies are experimenting with implementing stock trading algorithms on FPGAs connected directly to network interfaces...

        • Hmm.. I am pretty much in GPU architecture, but here's why I thought it would be great in the stock and weather forecasting.

          1. They involve a lot of matrix multiplications and matrix inversion algorithms, which, from what I heard can be handled nicely by the GPU.

          2. This is a very naive thought, but TFA mentioned talked about easy parallelization using GPU. This can be harnessed by the multitude of parallel, machine learning algorithms out there.

          However, after some searching, I came across a white paper (dam

    • by smorken ( 990019 )
      The only time that you want to use a GPU is when your code has a high proportion of numerical operations, and when your problem can be executed in parallel. (modeling, graphics) If this is not the case then using a GPU is not going to speed things up. Code where you are mostly just moving data around with sparse calculations (routers, databases, webservers, AV) is not a good problem for video cards.
    • by nochez ( 1850334 )
      Imagine the day when someone finally implements a GPU accelerated "make me a sandwich" (http://xkcd.com/149/) ... that, would be pure awesomeness.
  • by tyrione ( 134248 ) on Friday May 06, 2011 @11:35PM (#36054334) Homepage
    Instead, one should use OpenCL. It's Platform Agnostic for a reason, but don't let Linux's chance to be hypocritical step in the way.
  • In former times, people made sure you knew they used Slackware, then LFS, then Gentoo, now Ubuntu.

    Distributions are like a penis and religion...

    Anyway, get off my lawn.

  • 4x speedup is nothing. Using the GPU correctly should bring much higher speedups.
    That kind of gain could simply be obtained by optimizing the CPU code.

    • Indeed. It has been my experience that when crypto writers move their libs from C to well optimized x86 assembly language they get at least 2x performance boost.

      These guys are getting 4x, but only on a fairly powerful GTX 480 GPU. How will a typical mobile GPU's compare? Probably even slower than the CPU, right? This article makes me sad.
  • there are plenty of architectures specific vector instruction sets on the CPU that the kernel could be taking advantage of instead; for example SSE and AltiVec for x86 and PPC respectivlly.
  • For the last ~8 years I've needed extremely fast encryption (and compression) in the project I use. A few years ago when CUDA began to gain traction, I got all excited and actually decided to see what was necessary to make it work and see how fast it was.

    Well at the time, I discovered that CUDA enabled encryption is quite fast. The problem is that copying the data segment to the GPU, doing the encryption and then copying the result back is painful. The copies and setup/interrupt/etc add so much latency that

... though his invention worked superbly -- his theory was a crock of sewage from beginning to end. -- Vernor Vinge, "The Peace War"