Category Archives: gpu

Particle emitter (3)

Seems like I’m done with a proof-of-concept implementation of an OpenCL+OpenGL particle collision tester 🙂

For the curious, white line segments represent direction in which a collision would have been considered for the frame (but not necessarily affecting the particle movement). Blue lines represents velocity vectors (unnormalized)

Particle emitter (1)

Experimenting a little bit with particle emitters..

The particles you can see above bounce over an infinite plane and a sphere that is in the origin of the scene. This is simple Transform Feedback at work – it costs literally nothing (as the numbers tell) and the renderer is capped at 60 FPS.

For the next stop, I’m planning to have the particles bounce over a mesh represented with a Kd tree.

Spherical Harmonics (4): Battle-field report

Finally got my GPU-based Kd-tree traversal implementation to work! With the tool at hand, there’s nothing stopping me from calculating visibility for model vertices in an off-line manner, converting the data to SH representation and then using it for AO calculation within my engine 🙂 Unfortunately, GTX 570 is far too slow to do the calculations in real-time – a rather typical test scene presented below, consisting of ~25k vertices, takes about 1 minute to calculate the data (with 2500 rays shot from every point) and I estimate available optimisation potential to be placed somewhere near 50%, which leaves us with 30 seconds needed for precalculations; okay for a demo loader, but nowhere near real-time. I’m using OpenCL for the purpose – sticking to a shader-based approach would probably have been a bit faster but I’d lose the neat extendability I can afford when using CL.

Compared to my previous engine (which laid ground for Three Fourth, Futuricon and Suxx), current state of the scene importing layer in Emerald is that it lacks nearly everything 😉 Having finished the visibility precalc work, I will now be able to move on with support for in-direct bounces so expect some more interesting screen-shots in a week or two!




Spherical Harmonics: Fooling around (1)

I’m currently doing some experiments with Spherical Harmonics lighting, with all sample coefficients calculated (or rather: pre-calculated, as it – sadly! – turns out) on the GPU. I’ll be posting some screenshots in the next couple of days till I get fed-up with the technique and decide to move on 🙂
The screen-shots will be posted without any post-processing applied, so please forgive me for lack of any proper anti-aliasing / color grading / tone mapping / what-not. You have been warned.

(..and if you ask, yes: I’m aware of various artifacts in the shots.. 🙂 )

Tips and tricks: OpenGL context sharing

I’m still polishing CamMuxer software for WeCan 2012. Up to now, data buffers reported by DirectShow graphs running intermittently were cached in the process space and then passed to VRAM by means of glTexSubImage2D() calls, executed from within the main rendering pipeline handler. This was far from perfect, for at least the following reasons:

  • introduced unnecessary lags into main rendering pipeline – this might seem like a minor issue but there is a certain pitfall here. CamMuxer supports showing a slider (containing up to two lines of text) which slides in and out in a smoothstep manner. Any lag introduced into the pipeline delays the buffer swap operation, hence causes a clearly noticeable jerkiness we do not want.
  • stalled the pipeline glTex*Image*() calls are synchronous (unless you operate on buffer objects bound to pixel unpack targets instead of client-space pointers, in which cause the operation can take place without crossing GPU borders. This unfortunately is not the case in my software 🙂 )

The next step that I’ve been working on the last couple of days was rearranging things a bit by introducing worker threads. Each camera stream is assigned one worker thread. The thread creates an invisible window and binds a new OpenGL context to the window. The new context shares namespace with the main context. This allows us to move the hectic glTexImage*() calls to separate per-camera threads, effectively off-loading the main rendering pipeline  and – in the end – providing a smoother watching experience. 

I’ve already done some work with context sharing during my recent commercial activities, so I knew most of the wiles OpenGL could throw at me. However, there was one thing that caught me off-guard and that I thought would have been worthy to put up on the blog:

If you happen to work with multiple contexts working in multiple threads and you use threads that run in parallel to render off-screen textures using a shared program object, do make sure you serialize the process! No, really double-check.

The not-so-obvious catch in this case is that context sharing does not really mean you can access all OpenGL objects in all contexts that use common namespace. There are some object types that are not shared (like FBOs, for instance), no matter how hard you try. Right, so far so good.
The thing I’ve forgotten myself was that program objects (which are shared) are also assigned information regarding values of the uniforms associated with the program. Before running it, many times you happen to update at least one of the associated values. Boom! Good luck if you happen to be doing that from within multiple contexts running in multiple threads 🙂 Wrapping the program execution routines in process-wide critical sections did the trick for me.

SITUATION UPDATE: Turns out wrapping all the code using a program object inside a CS does not solve the problem, nor forced glFlush()  and glFinish() calls do. Looks like a program object must explicitly be created in and used within a single context. If you don’t play ball, you get nasty uniform value leakages that appear to be caused by inexplicable forces. This might be NViDiA-specific, I don’t recall seeing explanation behind this kind of behaviour in GL3.3 core profile specification 🙁

CamMuxer – still on it!

I’m continuing to work on the software for muxing image from various sources into one output using the GPU, adding some additional image processing along the way to make use of the horsepower my video card continues to offer.

What I’ve finished plugging in today evening is per-camera feed content equalization. What this means is that each texture update triggers a histogram calculation, followed by a summed-area table calculation for the single-dimensional histogram texture generated in the previous pass, ending with the actual linearlization of the texture.

In terms of bus usage, that’s at least 30+30+15 frames per second, which gives a rough of 33.5MB * 3 = 100.5MB pumped into VRAM per sec, with each frame being equalized right afterward, converted to YUV and then pumped back by bus to RAM and sent to another process for streaming.

Statistics? DPCs never exceed 15%, CPU usage stays at 40% (quad-core) and GPU remains idle 60% of the time.

It’s just incredible how powerful nowaday desktops are!