Q: What are some of the potential causes for this error reported verbosely by NViDiA drivers using context notification call-backs, apart from the obvious catches mentioned in OpenCL 1.1 specification?
A: Well, if you ever meet the funny fellow, please make double sure you:
– meet type alignment requirements (no, really, single floats must be aligned to 4 in your memory buffer!)
– do not cast global-space pointers of type A to private-space ones (which is the default behaviour if you skip the __global keyword!) of type B.
– edit: initialize default values for all private variables. Yes. I have just fixed an issue that was giving awfully weird and totally unexpected side effects just because it was not initialized and I had the audacity to query value of the variable.. until now!
It’s just so EASY to ignore these rules, the driver acts like an attercop, being nothing but helpful, and then you suddenly have that “Oh shit, no!” moment out of nowhere when you suddenly learn the lesson and feel as depicted below.
The last couple of days were quite fun, thanks to a rather weird quirk in Lightwave SDK. Apparently, the software does its best to convince you it’s never a good idea to have a wish about being able to throw a couple of rays throughout the scene, unless you’re into volumetrics or you happen to be writing a shader implementation. If that’s the case – congratulations, you may stop holding your breath now 🙂
However, if you ever happen to be in a position where you’d like to cast a few rays around the scene from within that export plug-in of yours, especially the one that you have already spent a couple of hours on, coding it in native C instead of LWScript in order to get access to lower-level stuff, let me cut the chase and get to the rotten core.
The question is: will it happen? Even if I’m only interested in “has it intersected with any kind of geometry?” information, and I don’t really give a single penny about color of the triangle that I have bombed with my ray? Well, the answer is:
And no, the logical trick as depicted below:
does not work either 😉 Ray-casting in Lightwave appears to be reserved for a very narrow set of plugins only and there’s nothing you can do about it. That is, unless you are desperate enough to hack your way through by writing yet another plug-in that is completely outside the scope of your original interest, but that would share the functionality with your export plug-in by some sort of inter-plugin request queueing. Let’s be honest though – it’s really last resort stuff that I personally would only reserve for commercial projects. It’s very likely to be a time-consuming, dull and painful experience and pet projects are supposed to be fun, right?
The limitation was a very nasty surprise and, frankly, Lightwave had me caught off guard for a moment. I simply needed the visibility information on a per-vertex basis in order to be able to do SH-based AO for meshes that I’d like to export using the tool and the last thing that would have crossed my mind is that such functionality is a no-go for export plug-ins.
Given that the triangle count for some of the scenes is very likely to exceed million or two and I need to be able to cast at least 100.000 rays for every single point in a reasonable amount of time, I certainly didn’t want to go with naive, brute-force O(N^2) approach.
So, the cunning plan for the next week that I came up with is:
* Implement CPU-based Kd tree generation for meshes;
* Implement parallelized GPU-based ray-triangle intersection kernel that makes use of the filled structure and allows you to throw a lot of rays in one go, filling a hit/didn’t hit cell for each ray;
* Using the two features above, implement visibility information exporting in the plug-in.
The first bullet-point is now ready 🙂 To prove it, here’s a quick screen-shot:
More are yet to come, of course – I’m really interested in seeing what the GPU performance is going to be like for this task!
Have you ever heard about Optimus technology? It’s a software/hardware solution, where you have both Intel & NViDiA GPUs located on the same board. From strictly practical point of view, the main idea is to let the user enjoy passive work of Intel GPU (and.. err, that’s about it 🙂 ), with NViDia GPUs kicking in when needed. The kicking-in is handled by introducing an additional routing layer on top of GL and DX stacks, which is responsible for forwarding all the calls to either of implementations, depending on how resource-demanding the user’s actions are.
As far as theory is considered, that’s about it.
The most tricky part for the driver is to discover when it is the right time to switch to the other GPU, without having the user cast his laptop on the floor due to rather unpleasant excessive heat suddenly starting to blow on his balls. And that’s where things get ugly. To cut the story short – if you expected complex heuristics or neural networks trying to work out where the threshold is, you should go back to the drawing board. There is no such thing and this can lead to a gazillion of hours of your development time literally going down the drain, as there’s a HUGE difference between capabilities of both GPUs (not to mention the quality of OpenGL implementation, but that’s another story..).
So. If you ever find yourself trying to resolve a bug occurring on a specific laptop only, with the laptop theoratically being capable to handle a given OpenGL or DirectX version (“At least that’s what the specs say, m’kay!“), you’d better check if it’s Optimus-enabled. If it is, either go to NVIDIA Control Panel and disable the devil OR read the following link. It really could save your manday 🙂
I’m still polishing CamMuxer software for WeCan 2012. Up to now, data buffers reported by DirectShow graphs running intermittently were cached in the process space and then passed to VRAM by means of glTexSubImage2D() calls, executed from within the main rendering pipeline handler. This was far from perfect, for at least the following reasons:
introduced unnecessary lags into main rendering pipeline – this might seem like a minor issue but there is a certain pitfall here. CamMuxer supports showing a slider (containing up to two lines of text) which slides in and out in a smoothstep manner. Any lag introduced into the pipeline delays the buffer swap operation, hence causes a clearly noticeable jerkiness we do not want.
stalled the pipeline – glTex*Image*() calls are synchronous (unless you operate on buffer objects bound to pixel unpack targets instead of client-space pointers, in which cause the operation can take place without crossing GPU borders. This unfortunately is not the case in my software 🙂 )
The next step that I’ve been working on the last couple of days was rearranging things a bit by introducing worker threads. Each camera stream is assigned one worker thread. The thread creates an invisible window and binds a new OpenGL context to the window. The new context shares namespace with the main context. This allows us to move the hectic glTexImage*() calls to separate per-camera threads, effectively off-loading the main rendering pipeline and – in the end – providing a smoother watching experience.
I’ve already done some work with context sharing during my recent commercial activities, so I knew most of the wiles OpenGL could throw at me. However, there was one thing that caught me off-guard and that I thought would have been worthy to put up on the blog:
If you happen to work with multiple contexts working in multiple threads and you use threads that run in parallel to render off-screen textures using a shared program object, do make sure you serialize the process! No, really double-check.
The not-so-obvious catch in this case is that context sharing does not really mean you can access all OpenGL objects in all contexts that use common namespace. There are some object types that are not shared (like FBOs, for instance), no matter how hard you try. Right, so far so good.
The thing I’ve forgotten myself was that program objects (which are shared) are also assigned information regarding values of the uniforms associated with the program. Before running it, many times you happen to update at least one of the associated values. Boom! Good luck if you happen to be doing that from within multiple contexts running in multiple threads 🙂 Wrapping the program execution routines in process-wide critical sections did the trick for me.
SITUATION UPDATE: Turns out wrapping all the code using a program object inside a CS does not solve the problem, nor forced glFlush() and glFinish() calls do. Looks like a program object must explicitly be created in and used within a single context. If you don’t play ball, you get nasty uniform value leakages that appear to be caused by inexplicable forces. This might be NViDiA-specific, I don’t recall seeing explanation behind this kind of behaviour in GL3.3 core profile specification 🙁
I’m really, really trying my best, but it goes beyond a scope of my understanding why Microsoft LiveCam Cinema drivers have this odd custom of respawning 2 threads per second when accessed by means of DirectShow capture filter they implement! (That actually might be somehow connected with the amount of cores my CPU has, which is 4)
It only happens after you launch streaming and goes out the moment you pause or put the stream to a full stop. The behavior does not change whether you stream from other cameras as well in the same process or not.
So what goes on beyond the weird threads’ trunk? Apparently, they seem to spend quite a lot of time plotting behind my back in _vcomp_atomic_div_r8() function, nested within vcomp90.dll, which happens to be a part of OpenMP. Could it be distributed computing gone wrong..?
Does atexit() / _onexit() accidentally happen not to work for you? Same was in my case. Most of the explanations I could find for this situation on the web can be wrapped up in these two bullet-points:
* Things done wrong in DLL_PROCESS_ATTACHED handler;
* Not all library handles closed;
Guess what: there’s more to it! In my case (64-bit Windows 7, dynamically-linked library automatically sucked in on start-up by application) the solution turned out to be very odd. Here’s an outline of what I was doing:
* Main thread spawns a child thread, which is used for message pumping purposes;
* (..all kind of the usual application’s sorcerery goes in here..)
* Main thread decides it’s high time for the process to die.
* Main thread requests the window to close (sends a message to the window, waits till a system event telling “Message pump thread waves good bye” is set by the pumping thread right before it quits)
* Main thread deinitializes stuff, calls ExitProcess()
Now, if you do a little bit of a reading, you’d expect ExitProcess() to call back your handler(s) of choice in an instant. Nope, nada. In my case, what I was seeing was.. well. the process just killed itself without no apparent reason 🙂 No debugging output, no crashes, no breakpoints, no feed-back from CRT library, no nothing.
So what was the reason? As simple as that: you have to do a WaitForSingleObject() call on the main thread to make sure the message pump thread is already dead before you attempt to exit the main thread. What’s funnier – it doesn’t matter if you have other threads you spawned running in the background – window hosting threads must be no longer present in order for the “on exit” callback to work.
Hope this saves somebody a few hours I had to spend investigating this.
I’m continuing to work on the software for muxing image from various sources into one output using the GPU, adding some additional image processing along the way to make use of the horsepower my video card continues to offer.
What I’ve finished plugging in today evening is per-camera feed content equalization. What this means is that each texture update triggers a histogram calculation, followed by a summed-area table calculation for the single-dimensional histogram texture generated in the previous pass, ending with the actual linearlization of the texture.
In terms of bus usage, that’s at least 30+30+15 frames per second, which gives a rough of 33.5MB * 3 = 100.5MB pumped into VRAM per sec, with each frame being equalized right afterward, converted to YUV and then pumped back by bus to RAM and sent to another process for streaming.
Statistics? DPCs never exceed 15%, CPU usage stays at 40% (quad-core) and GPU remains idle 60% of the time.
It’s just incredible how powerful nowaday desktops are!