Having completely rewritten all the low-level parts of the engine with a completely new approach (details available at beer gardens on demo-scene parties I’ll be attending in the near future 🙂 ), I finally started working on a new set of effects I intend to use in next Elude‘s PC demo.
My current area of interest is a GPU-only Spherical Harmonics-based lighting implementation. Since the approach is more or less based on calculating approximations of a gazzilion+1 of integrals, the pre-calc shaders tend to run in long loops. The problem that I found myself in today was that after issuing a transform feedback-enabled glDrawArrays() call with the “rasterizer discard” mode enabled, the user-mode layer of the OpenGL implementation started sucking in lots of memory (>1 GB and increasing..), as well as blocked on the call. This is quite unusual, given the rather asynchronous nature of NViDiA’s implementation of OpenGL, which tends to enqueue the incoming commands in the command buffer and return the control flow to the caller at an instant!
Let me interrupt at this moment and shed some light on a different aspect that rendered the problem even more weird: Emerald stashes GL program blobs on hard-drive after linking finishes successfully, so that future linking requests will not (theoretically) require full recompilation of the attached shaders – assuming their hashes did not change, of course, in which case the engine goes the usual path.
Coming back to the problem. I played around the shader and tried commenting out various bits and pieces, checking how the modifications affected the driver’s behaviour. It wasn’t long till I discovered it was caused by the large number of iterations I was doing in the loop. One way to get around this would be to reinvent the approach and break down the loop into multiple passes but it didn’t sound like a good idea. I expect the implementation to be rather demanding in terms of hardware and introducing CPU interventions where it’s not strictly necessary is something that I’d prefer to avoid.
After doing some hunting on them nets, I found a solution that appears to be NViDiA only. It works for me and I have no AMD hardware at hand to see if they also support the pragma that altered the compiler’s behaviour so I can’t tell if it’s portable – I can’t see any OpenGL extension which would describe the feature, so it’s probably a bad omen..
Turns out that what needs to be done is to use:
#pragma optionNV(unroll none)
right before the loop in question, which prevents the shader compiler from endlessly trying to unloop it, forcing it to the “Just don’t!” line of thinking. The behaviour has a very large OOM potential and I’m really willing to learn why NViDiA decided to go with such deep unrolls. I can only imagine how painful that decision must be to all the folks working on fractal visualisations 🙂