Having completely rewritten all the low-level parts of the engine with a completely new approach (details available at beer gardens on demo-scene parties I’ll be attending in the near future 🙂 ), I finally started working on a new set of effects I intend to use in next Elude‘s PC demo.
My current area of interest is a GPU-only Spherical Harmonics-based lighting implementation. Since the approach is more or less based on calculating approximations of a gazzilion+1 of integrals, the pre-calc shaders tend to run in long loops. The problem that I found myself in today was that after issuing a transform feedback-enabled glDrawArrays() call with the “rasterizer discard” mode enabled, the user-mode layer of the OpenGL implementation started sucking in lots of memory (>1 GB and increasing..), as well as blocked on the call. This is quite unusual, given the rather asynchronous nature of NViDiA’s implementation of OpenGL, which tends to enqueue the incoming commands in the command buffer and return the control flow to the caller at an instant!
Let me interrupt at this moment and shed some light on a different aspect that rendered the problem even more weird: Emerald stashes GL program blobs on hard-drive after linking finishes successfully, so that future linking requests will not (theoretically) require full recompilation of the attached shaders – assuming their hashes did not change, of course, in which case the engine goes the usual path.
Coming back to the problem. I played around the shader and tried commenting out various bits and pieces, checking how the modifications affected the driver’s behaviour. It wasn’t long till I discovered it was caused by the large number of iterations I was doing in the loop. One way to get around this would be to reinvent the approach and break down the loop into multiple passes but it didn’t sound like a good idea. I expect the implementation to be rather demanding in terms of hardware and introducing CPU interventions where it’s not strictly necessary is something that I’d prefer to avoid.
After doing some hunting on them nets, I found a solution that appears to be NViDiA only. It works for me and I have no AMD hardware at hand to see if they also support the pragma that altered the compiler’s behaviour so I can’t tell if it’s portable – I can’t see any OpenGL extension which would describe the feature, so it’s probably a bad omen..
Turns out that what needs to be done is to use:
#pragma optionNV(unroll none)
right before the loop in question, which prevents the shader compiler from endlessly trying to unloop it, forcing it to the “Just don’t!” line of thinking. The behaviour has a very large OOM potential and I’m really willing to learn why NViDiA decided to go with such deep unrolls. I can only imagine how painful that decision must be to all the folks working on fractal visualisations 🙂
That’s just an “once-over” kind of a call stack, if you ask me 🙂
Have you ever heard about Optimus technology? It’s a software/hardware solution, where you have both Intel & NViDiA GPUs located on the same board. From strictly practical point of view, the main idea is to let the user enjoy passive work of Intel GPU (and.. err, that’s about it 🙂 ), with NViDia GPUs kicking in when needed. The kicking-in is handled by introducing an additional routing layer on top of GL and DX stacks, which is responsible for forwarding all the calls to either of implementations, depending on how resource-demanding the user’s actions are.
As far as theory is considered, that’s about it.
The most tricky part for the driver is to discover when it is the right time to switch to the other GPU, without having the user cast his laptop on the floor due to rather unpleasant excessive heat suddenly starting to blow on his balls. And that’s where things get ugly. To cut the story short – if you expected complex heuristics or neural networks trying to work out where the threshold is, you should go back to the drawing board. There is no such thing and this can lead to a gazillion of hours of your development time literally going down the drain, as there’s a HUGE difference between capabilities of both GPUs (not to mention the quality of OpenGL implementation, but that’s another story..).
So. If you ever find yourself trying to resolve a bug occurring on a specific laptop only, with the laptop theoratically being capable to handle a given OpenGL or DirectX version (“At least that’s what the specs say, m’kay!“), you’d better check if it’s Optimus-enabled. If it is, either go to NVIDIA Control Panel and disable the devil OR read the following link. It really could save your manday 🙂
Looking for a solution for the following problem?:
fatal error LNK1112: module machine type 'X86' conflicts with target machine type 'AMD64'
Guess what: it’s brain-dead simple to solve! Just head on to Project Properties/Linker/Command Line. See the “Additional Options” at bottom? It’s got a “/MACHINE:X86” line which shouldn’t be there. Just get rid of it and you’re set.
I’ve hit this problem for so many times now that I think this post might actually save some of those precious hours you don’t have for spending time on hunting down such trivial bugs..
I’m still polishing CamMuxer software for WeCan 2012. Up to now, data buffers reported by DirectShow graphs running intermittently were cached in the process space and then passed to VRAM by means of glTexSubImage2D() calls, executed from within the main rendering pipeline handler. This was far from perfect, for at least the following reasons:
- introduced unnecessary lags into main rendering pipeline – this might seem like a minor issue but there is a certain pitfall here. CamMuxer supports showing a slider (containing up to two lines of text) which slides in and out in a smoothstep manner. Any lag introduced into the pipeline delays the buffer swap operation, hence causes a clearly noticeable jerkiness we do not want.
- stalled the pipeline – glTex*Image*() calls are synchronous (unless you operate on buffer objects bound to pixel unpack targets instead of client-space pointers, in which cause the operation can take place without crossing GPU borders. This unfortunately is not the case in my software 🙂 )
The next step that I’ve been working on the last couple of days was rearranging things a bit by introducing worker threads. Each camera stream is assigned one worker thread. The thread creates an invisible window and binds a new OpenGL context to the window. The new context shares namespace with the main context. This allows us to move the hectic glTexImage*() calls to separate per-camera threads, effectively off-loading the main rendering pipeline and – in the end – providing a smoother watching experience.
I’ve already done some work with context sharing during my recent commercial activities, so I knew most of the wiles OpenGL could throw at me. However, there was one thing that caught me off-guard and that I thought would have been worthy to put up on the blog:
If you happen to work with multiple contexts working in multiple threads and you use threads that run in parallel to render off-screen textures using a shared program object, do make sure you serialize the process! No, really double-check.
The not-so-obvious catch in this case is that context sharing does not really mean you can access all OpenGL objects in all contexts that use common namespace. There are some object types that are not shared (like FBOs, for instance), no matter how hard you try. Right, so far so good.
The thing I’ve forgotten myself was that program objects (which are shared) are also assigned information regarding values of the uniforms associated with the program. Before running it, many times you happen to update at least one of the associated values. Boom! Good luck if you happen to be doing that from within multiple contexts running in multiple threads 🙂 Wrapping the program execution routines in process-wide critical sections did the trick for me.
SITUATION UPDATE: Turns out wrapping all the code using a program object inside a CS does not solve the problem, nor forced glFlush() and glFinish() calls do. Looks like a program object must explicitly be created in and used within a single context. If you don’t play ball, you get nasty uniform value leakages that appear to be caused by inexplicable forces. This might be NViDiA-specific, I don’t recall seeing explanation behind this kind of behaviour in GL3.3 core profile specification 🙁
I’m really, really trying my best, but it goes beyond a scope of my understanding why Microsoft LiveCam Cinema drivers have this odd custom of respawning 2 threads per second when accessed by means of DirectShow capture filter they implement! (That actually might be somehow connected with the amount of cores my CPU has, which is 4)
It’s just spawn, spawn, kill, kill, spawn, spawn, kill, kill… Pew, pew, pew..
It only happens after you launch streaming and goes out the moment you pause or put the stream to a full stop. The behavior does not change whether you stream from other cameras as well in the same process or not.
So what goes on beyond the weird threads’ trunk? Apparently, they seem to spend quite a lot of time plotting behind my back in _vcomp_atomic_div_r8() function, nested within vcomp90.dll, which happens to be a part of OpenMP. Could it be distributed computing gone wrong..?
Does atexit() / _onexit() accidentally happen not to work for you? Same was in my case. Most of the explanations I could find for this situation on the web can be wrapped up in these two bullet-points:
* Things done wrong in DLL_PROCESS_ATTACHED handler;
* Not all library handles closed;
Guess what: there’s more to it! In my case (64-bit Windows 7, dynamically-linked library automatically sucked in on start-up by application) the solution turned out to be very odd. Here’s an outline of what I was doing:
* Main thread spawns a child thread, which is used for message pumping purposes;
* (..all kind of the usual application’s sorcerery goes in here..)
* Main thread decides it’s high time for the process to die.
* Main thread requests the window to close (sends a message to the window, waits till a system event telling “Message pump thread waves good bye” is set by the pumping thread right before it quits)
* Main thread deinitializes stuff, calls ExitProcess()
Now, if you do a little bit of a reading, you’d expect ExitProcess() to call back your handler(s) of choice in an instant. Nope, nada. In my case, what I was seeing was.. well. the process just killed itself without no apparent reason 🙂 No debugging output, no crashes, no breakpoints, no feed-back from CRT library, no nothing.
So what was the reason? As simple as that: you have to do a WaitForSingleObject() call on the main thread to make sure the message pump thread is already dead before you attempt to exit the main thread. What’s funnier – it doesn’t matter if you have other threads you spawned running in the background – window hosting threads must be no longer present in order for the “on exit” callback to work.
Hope this saves somebody a few hours I had to spend investigating this.
A quick thought for today evening:
Pushing 36,000KB of data from VRAM to RAM per second by means of doing 180 glReadPixels() calls per second is a bad, bad idea..
*gasp*. I’ve always known that reading pixels off a color buffer of either type of FBO is not the gentlemen’s way of behaving, but – to be honest – it’s the first time I’m hitting the dreaded bus throughput problem. The issue I’m getting is that the rendering output appears jerky and the jittering appears to be happening in rather random delays. Yes, random enough to ruin the whole smooth experience 🙂 Oh well, with 60 FPS set as the desired frame-rate, it was bound to happen.
Problem is, CamMuxer is already complex enough for such a little pet project I intended it to be at the beginning, but looks like there’s no other way to overcome performance bottlenecks I’m seeing but by introducing PBOs to the pipeline, making the solution even more twisted than it already is.
Funny thing about it is that the jerky updates come into play only if you start moving the cursor around. My guess is that it could have something to do with more frequent, system-enforced context switching occuring due to necessary window repaints. May it be that Microsoft might have finally started to hardware accelerate GDI with the advent of Windows 7?
I’m continuing to work on the software for muxing image from various sources into one output using the GPU, adding some additional image processing along the way to make use of the horsepower my video card continues to offer.
What I’ve finished plugging in today evening is per-camera feed content equalization. What this means is that each texture update triggers a histogram calculation, followed by a summed-area table calculation for the single-dimensional histogram texture generated in the previous pass, ending with the actual linearlization of the texture.
In terms of bus usage, that’s at least 30+30+15 frames per second, which gives a rough of 33.5MB * 3 = 100.5MB pumped into VRAM per sec, with each frame being equalized right afterward, converted to YUV and then pumped back by bus to RAM and sent to another process for streaming.
Statistics? DPCs never exceed 15%, CPU usage stays at 40% (quad-core) and GPU remains idle 60% of the time.
It’s just incredible how powerful nowaday desktops are!
Ever since I started working on various projects, be those commercial or not, I’ve always wanted to have a place to share thoughts that emerge in course of the process. A small blog that I could use for posting both short messages and some larger posts on 3D programming, sometimes a few screen-shots. A place that I could publish a few pieces of information about myself, in case there was somebody out there caring for a read or where I could share a few apps I wrote, just for the sake of sharing.
With some hosting and set-up help from Ubik, there we go. Not much for the starters, I admit, but feel free to visit a few pages I’ve prepared prior to launch of this wordpress!