Writing Linux Kernel Functions In CUDA With KGPU 101
An anonymous reader writes "Until today, GPGPU computing was a userspace privilege because of NVIDIA's closed-source policy and AMD's semi-open state. KGPU is a workaround to enable Linux kernel functionality written in CUDA. Instead of figuring out GPU specs via reverse-engineering, it simply uses a userspace helper to do CUDA-related work for kernelspace requesters. A demo in its current source repository is a modified eCryptfs, which is an encrypted filesystem used by Ubuntu and other distributions. With the accelerated performance of a GPU AES cipher in the Linux kernel, eCryptfs can get a 3x uncached read speedup and near 4x write speedup on an Intel X25-M 80G SSD. However, both the GPU cipher-based eCryptfs and the CPU cipher-based one are changed to use ECB cipher mode for parallelism. A CTR, counter mode, cipher may be much more secure, although the real vanilla eCryptfs uses CBC mode. Anyway, GPU vendors should think about opening their drivers and computing libraries, or at least providing a mechanism to make it easy to do GPU computing inside an OS kernel, given the fact that GPUs are so widely deployed and the potential future of heterogeneous operating systems."
AES-NI (Score:1)
Re: (Score:2)
Re: (Score:2)
The problem is that parallelized encryption is not as secure as the other modes. Let me show you the difference between CBC, ECB and CTR ( block(i) means the i'th block of data)
1) CBC
CBC(pwd, block(i)) == encrypt(pwd, block(i)) xor block(i-1)
* block(-1) = hash(pwd, 0) (sometimes half the password is used as block(-1))
2) ECB
ECB(block(i)) = encrypt(block(i))
3) CTR
CTR(block(i)) = encrypt(block(i)) xor i
I hope it's obvious why CBC and CTR are the only candidates for parallelization. CBC
Re: (Score:2)
Just a couple small nits to pick..
Although CBC encryption needs to be done in sequence, CBC decryption can be done mostly in parallel (don't have to wait until you do the AES part of the previous block)...
Also security is better than other modes only in some cases. As a trivial example, in CBC it's easier to tamper with the plain-text.: all you have to do to flip a bit in the plaintext of a CBC encrypted stream is to flip that same bit in the previous block's cipher-text. Although that kills that previou
Re: (Score:2)
3) CTR
CTR(block(i)) = encrypt(block(i)) xor i
Sorry, but what you describe is not CTR mode. Using your notation, CTR would look (roughly) like this:
CTR(block(i)) = encrypt(counter) xor block(i)
where "counter" is usually constructed by concatenating a nonce value with i
(the block number). It is critical that the resulting counter never be re-used
with the same key for a different block).
Re: (Score:3)
That's not how CTR works. Rather it works like
CTR(block(i)) = encrypt(IV || i) xor block(i)
However since most storage encryptions cheat and use an IV that is the same every time you write to the same logical sector, the CTR mode will actually turn into a pseudorandom one-time-pad. This means if you ever write to the same logical sector number twice, you are potentially leaking data. In the case of ecryptfs it is probably only a problem if you overwrite sectors in a
Re: (Score:3)
I'm curious, would CTR be less vulnerable if one XORed before encryption? Call the operation CXR.
Where ^ is the XOR operator
CXR(block(i)) = encrypt(IV ^ i ^ block(i))
I'm not sure if there is analysis that can be done on the block at that point that makes this undesirable. Methinks not because as far as I know having a well known IV in, say, CBC is not a vulnerability. That implies to me that the security still rests firmly in the key. At the very least it stops being vulnerable to bitwise changes and reinst
Re: (Score:3)
This is about as secure as ECB, but that's still better than what you get from incorrect use of CTR that degenerates to multiple use of a one-time-pad. What you want is a tweakable block cipher. Just use the block using i as tweak. That is how LRW mode works, with a specific construction for the tweakable block cipher.
One of the constructions for the tweakable block cipher is encrypt(t ^ encrypt(plaintext)), a more efficient construction (but requires a larger ke
Re: (Score:3)
Re: (Score:2)
they were comparing cycle per byte not the total run time so the difference between the cpu generation is less important. But the rest of your argument is still quite valid.
Re: (Score:3)
Re: (Score:2, Informative)
KGPU uses AES just as a demonstration, it's architecture is general to any GPU-friendly algorithm.
Re: (Score:2)
Well I am sure it compares very favorably if you have an old CPU or a CPU of a different architecture which does not feature those instructions.
Re: (Score:3)
It's for an entirely different application. AES-NI is one application specific set of instructions. While encryption and decryption is an application in which dedicated hardware can have tremendous gains, introducing dozens of application specific hardware modules into a CPU is going to fall to diminishing returns, and just result in an oversized, expensive, and power hungry CPU. It's an inherently limiting design methodology. Introducing GPU access to the kernel opens up a very powerful piece of hardwa
Re: (Score:2)
introducing dozens of application specific hardware modules into a CPU is going to fall to diminishing returns, and just result in an oversized, expensive, and power hungry CPU
More oversized, expensive and power-hungry than the GeForce GTX 480 they used for this benchmark? It's right at the limits of manufactuability in terms of chip size, costs hundreds of dollars, and has a 300W power consumption at load. You'd need an awful lot of application-specific hardware modules before you even got close to that.
Re:Did a anyone else's brain switch off half way.. (Score:5, Informative)
GTFO!
This is what should be on slashdot, not stories about the latest iphone.
Re: (Score:1)
Re: (Score:2)
Re: (Score:1)
Completely off-topic, but I've been looking for a decent ssh client for my crapberry -- thanks!
Re: (Score:2)
Best possible example (Score:2, Interesting)
Hand off encryption routines to a closed source black box. Brilliant.
Re: (Score:3)
Yes, because the CPU isn't, we're all running open hardware /s
Re:Best possible example (Score:5, Insightful)
Re: (Score:1)
Good point.
In fact, Intel CPUs are worse in this regard, as they contain special AES instructions. GPUs, as far as I know, don't do this yet, so you'll know have a higher level of confidence that the correct code is indeed running.
Re: (Score:2)
Yes, and those AES instructions are well documented [scribd.com].
Re: (Score:2)
How can you be sure that what's going on on the processor is the same thing as what's described in the documentation?
Question: (Score:4, Interesting)
I thought that a tiny bit of kernel code reflecting calls into a user level process was old news, and has become established as the preferred development model. Is there a reason that it's undesirable?
Because the summary makes it sound like we're sad to be following this model, and we're only doing it because we can't pull NVidia's driver source into the linux kernel.
Re: (Score:2)
The NVIDIA extensions are only available in userland.
So a call to the kernel level crypto system gets routed back out to user land, and back to kernel land via the GPU module. That's why we're sad.
Re: (Score:2)
What I would like to know is since they're already taking the hit for downcalls into userspace, why not use fuse instead and let the userland filesystem daemon use the GPU. Why produce yet another mechanism to protect the kernel from the wierdness that can happen when it depends on userspace rather than the other way around?
Re: (Score:2)
Re: (Score:2)
Context-switching is always expensive, but avoiding it without regard to the actual benefit leads to system bloat, so learning where it is and isn't significant is a good skill to have.
The speedup from GPU hardware is so big that it's worth giving up a few hundred cycles of context switching to get a few thousand cycles of reduction in computing.
But (not having read TFA yet) I wonder just how much kernel functionality is really that parallelizable. When does the context switching cost you more than the CUD
Re: (Score:2)
Crypto stuff relying on gigundous keys would be a no-brainer, but where else could it be economical?
Maybe RAID computations. Block-level data-deduplication is starting to catch on and that needs to hash every block written to disk. i bet that could benefit from a GPU but the userland overhead may be enough to kill the practicality, at least for anything but long streaming writes.
Re: (Score:2)
There is overhead in a context switch from kernel space to user space.
Re: (Score:3)
Re: (Score:1)
It's also a giant architectural hack so that won't help matters.
Re: (Score:1)
a tiny bit of kernel code reflecting calls into a user level process
You mean generally? This could be said of micro-kernels but the LInux kernel is monolithic; Drivers for devices typically live entirely inside the kernel.
That being said I don't think it's necessarily desirable to pull every conceivable hardware interaction into the kernel. There is an endless variety of hardware and APIs. Why must all of this churn live in the kernel? The kernel<-->user-space bridge that was built to make the GPU vendors user-space API accessible by the kernel isolates the kernel
Re: (Score:1)
Main issue might be context switching? Or writing GPU binary code without having to compile via driver (i.e. a la math accelerator FPU?)
Cheers!
Re:Question: (Score:5, Interesting)
There are also other concerns than the context switch overhead...particularly when dealing with filesystems or data storage devices.
For instance, suppose part of your userspace daemon gets swapped out, and you now need to upcall to userspace. That part that got paged out then has to be paged back in. If memory is tight, then the kernel may have to free some memory, and it may decide to flush out dirty data to the filesystem or device that is dependent on the userspace daemon. At that point, you're effectively deadlocked.
Most of those sorts of problems can be overcome with careful coding and making sure the important parts of the daemon are mlocked, but you do have to be careful and it's not always straightforward to do that.
F*ck Nvidia AND AMD (Score:1)
Until they open-source drivers, I refuse to buy them. Stuff like this is typically a nightmare to install and keep running anyway.
Re: (Score:2)
Just what are you using for graphics hardware, then? Intel's integrated core?
Re: (Score:2)
Uphill, both ways?
Re: (Score:2)
The Hercules graphics card. :)
Re: (Score:2)
I only used open source graphics drivers, including Intel's integrated, until about 6 months ago when I needed to run some OpenCL code on a Radeon. There is nothing wrong with Intel graphics and the opensource Radeon drivers, unless you are a gamer or need serious GPGPU power. Both are capable of plenty of 3D, for example molecular modelling in my case.
I am posting this on a Powerbook running Linux, and for some strange reason AMD does not release binary drivers for PPC Linux ;) but the opensource Radeon
Re: (Score:2)
Just what are you using for graphics hardware, then? Intel's integrated core?
Yes, why? I don't play 3D games, so it's fine and stable.
Re: (Score:2)
Re: (Score:2)
Wow (Score:2)
Re: (Score:2)
I came to read a discussion of writing kernel functions in CUDA and a discussion about the vagaries of encryption methods broke out.
Be careful what you say, next we'll have a hockey game break out.
I'm sorry, but am I the only one here who thinks this is, well, not a good way to go? Even if the code could be kernel-space code on the GPU? I mean, if I buy a CUDA GPU, I'm doing it because I have serious computing I want to do on it, not because I want my file system reads to be faster. I'd be rather miffed if I spent the time writing my CUDA code to speed things up and then found out it wasn't speeding things up because the GPU was alrea
Re: (Score:1)
They're racking their brains as to what to do next.
I would aim for kernel threads running directly through CUDA and the Scheduler knowing the performance profile of suitable work for the GPU and the message-passing cost of moving work to the GPU^H^H^H parallelism co-processor. Make the interface right and you should be able to shift tasks across heterogeneous processing units. Do it perfectly and you can have a Linux Virtual Processor model which allows you to start running a task on your desktop, shuffle
Re: (Score:2)
You seem to be seriously overstating the impact of host-based printing. Obviously when you're not printing (and that's probably most of the time!), there's no overhead. And when you are printing, then the rasterizer consumes a little bit of memory and plenty of CPU, but that's transient. I would never venture as far as calling it "consuming" the computer.
I haven't personally felt it to be a problem, and I'm using a host-based printer (HP LJ P1006). It spits out about 17 pages per minute, not too shabby if y
Re: (Score:2)
You seem to be seriously overstating the impact of host-based printing.
Uhhh, no. I was there. Firsthand experience.
Obviously when you're not printing (and that's probably most of the time!), there's no overhead.
Other than the half a dozen monitor demons that tell you when there are updates for the drivers, when the printer is out of paper, when the printer is out of ink, when the printer is low on ink and would you like to buy official HP products now?, and whatever other things they had demons doing.
then the rasterizer consumes a little bit of memory and plenty of CPU,
The last 200Mb of disk is "a little bit"?
I haven't personally felt it to be a problem,
And thus it cannot have been a problem for me. Thanks.
Having a CPU capable of rasterizing that fast in the printer itself would probably double its cost,
Wow. A whole $100 for a printer. So the printer could actually be a print
Re: (Score:2)
The monitors and stuff are not a problem inherent in host-based printing. Not at all. For reasons better left to be explained by marketing types, HP's Windows printing support for home printer product line sucks donkey balls. Their support on Linux and Mac doesn't come with any of the overhead.
So what you're complaining against is not host based printing per se, but broken drivers peddled by HP and others, bundled with bloatware. There's no inherent technical reason for it to be that way. And the problem is
Re: (Score:2)
To be fair, most of that crap isn't actually the printer driver, it's the HP marketing trojan combined with REALLY bad design. A sane driver would only check paper and ink just before, during and just after a print job.
AES speed (Score:2)
Re: (Score:2)
All in good time (Score:3)
Re: (Score:2)
Recipe for a corrupted filesystem (Score:1)
Wow, the fragility of an encrypted file system plus the instability of a GPU, implemented in the kernel. Do not even read TFA without doing a full backup of your system.
Re: (Score:3)
fragility of an encrypted file system{citationneeded}.
I've been using them since 2006. Never had any problems.
Cool test... (Score:2)
As someone who's doing a lot of the same work, this is pretty spectacular! I'm surprised they get > 100MB/sec in software - but I guess that's due to using ECB mode vs. CBC. I think the real I/O limit here is probably in the user/kernel mem copies - context switch weight can be optimized with good buffer alignments.
We did a lot of testing with CUDA under openssl 3-4 years ago - in the end it was better to just stick with software. The latencies are the real killers.
Re: (Score:3)
That's a pretty cool project! But I do think they still suffer the same latency problems - in order to take advantage of the GPU's full throughput - they have to have a huge number of client connections (chosen solution) or a very deep queue (hard to optimize, only works with larger file sizes).
Certainly this is a great solution for what it is - but it's not a general purpose solution. And you can get a much more reliable and supported solution out there. (e.g. BIG-IP SSL Accelerator, which uses certif
Protection (Score:2)
Is it a good idea for the protected kernel to rely on unprotected code for critical functions such as filesystem operations? I know that user-space code cannot directly interfere with the kernel, but it also doesn't have to do anything the kernel requests of it. Unless the kernel is designed to treat such user-space code as altogether untrustworthy, it seems to me a bad idea for the kernel to rely on user-space code in this manner.
ECB Mode is totally insecure (Score:4, Interesting)
I hope this is just a proof-of-concept design because ECB mode should not be used for this purpose. Wikipedia provides a pretty obvious example of the weakness of ECB mode:
"The disadvantage of this method is that identical plaintext blocks are encrypted into identical ciphertext blocks; thus, it does not hide data patterns well. In some senses, it doesn't provide serious message confidentiality, and it is not recommended for use in cryptographic protocols at all. A striking example of the degree to which ECB can leave plaintext data patterns in the ciphertext is shown below; a pixel-map version of the image on the left was encrypted with ECB mode to create the center image, versus a non-ECB mode for the right image."
http://en.wikipedia.org/wiki/Block_cipher_modes_of_operation#Initialization_vector_.28IV.29 [wikipedia.org]
Re: (Score:2)
I hope so too, because I was excited by the idea of using my CUDA-capable GPU to do encryption, which might actually get me to use it. It's barely ticking over providing Compiz functions.
Re: (Score:2)
According to the summary, the GPU enhanced version uses ECB:
"A demo in its current source repository is a modified eCryptfs, which is an encrypted filesystem used by Ubuntu and other distributions . . . .However, both the GPU cipher-based eCryptfs and the CPU cipher-based one are changed to use ECB cipher mode for parallelism. "
Re: (Score:1)
Writing parallel code is difficult. Writing parallel code which makes sense even more. Actually, if you have a quad-core CPU and do ECB instead of CBC, then you can manage a 4x increase in performance ... no need to use a GPU!
(The reason is that ECB encryptions might be done in parallel, as each of them is independent; for CBC you need to know
the encryption of textblock-1 in order to produce that of a block).
A counter mode (CTR) might make sense for ecryptfs, but the security analysis is definitely non-tri
Re: (Score:2)
And because a picture straight from the horse's mouth is worth a thousand words, here's what NVidia has to say about it:
http://http.developer.nvidia.com/GPUGems3/gpugems3_ch36.html [nvidia.com]
Go to 36.5, figure 36-11 & 36-13.
Do NOT use ECB mode (Score:1)
Why not OpenCL? (Score:4, Interesting)
Re: (Score:2)
Re: (Score:3)
Came here to say this. Why the hell are they writing things in CUDA instead of OpenCL? CUDA is closed and Nvidia-proprietary!
Encryption is not the main beneficiary (Score:3)
Imagine mysql database GPU accelerated...
GPU accelerated routers, gpu acceleration of anti-virus software.
The use of gpus to accelerate search engines.
Re: (Score:1)
Re: (Score:2)
These days, automated stock trading is in fashion, which depends on having really tiny latencies - the exact opposite of what you get from GPU acceleration. I believe companies are experimenting with implementing stock trading algorithms on FPGAs connected directly to network interfaces...
Re: (Score:1)
Hmm.. I am pretty much in GPU architecture, but here's why I thought it would be great in the stock and weather forecasting.
1. They involve a lot of matrix multiplications and matrix inversion algorithms, which, from what I heard can be handled nicely by the GPU.
2. This is a very naive thought, but TFA mentioned talked about easy parallelization using GPU. This can be harnessed by the multitude of parallel, machine learning algorithms out there.
However, after some searching, I came across a white paper (dam
Re: (Score:1)
"I am pretty much ignorant* in GPU architecture"
Fixed.
Re: (Score:1)
Re: (Score:1)
CUDA? That makes zero sense (Score:3)
I like the random reference to Ubuntu (Score:2)
In former times, people made sure you knew they used Slackware, then LFS, then Gentoo, now Ubuntu.
Distributions are like a penis and religion...
Anyway, get off my lawn.
4x speedup is nothing (Score:2)
4x speedup is nothing. Using the GPU correctly should bring much higher speedups.
That kind of gain could simply be obtained by optimizing the CPU code.
Re: (Score:2)
These guys are getting 4x, but only on a fairly powerful GTX 480 GPU. How will a typical mobile GPU's compare? Probably even slower than the CPU, right? This article makes me sad.
SSE , AltiVec (Score:2)
Re: (Score:2)
open CUDA or give up. (Score:2)
For the last ~8 years I've needed extremely fast encryption (and compression) in the project I use. A few years ago when CUDA began to gain traction, I got all excited and actually decided to see what was necessary to make it work and see how fast it was.
Well at the time, I discovered that CUDA enabled encryption is quite fast. The problem is that copying the data segment to the GPU, doing the encryption and then copying the result back is painful. The copies and setup/interrupt/etc add so much latency that