OP Performance Progress

I think that this next release will probably be reading in a week or so. I am running out of things I wanted to change or ideas I wanted to try within the realm of "performance", so it is probably time to wrap this up and move onto the next release.

So, how does it look? Well, there have been some substantial improvements. It is now the point where I do profile runs against a server with 40 local clients modifying state to see how it performs and, while it does sometimes drop a tick, these cases are pretty rare and it does perform reasonably, quite consistently.

As mentioned in the previous update, a lot of changes were made to the core TickRunner merge logic, and that did improve some things. On top of that, more detailed performance timers have been added to show more details of why a tick was slow, when it is (not enough to give you a smoking gun, but at least the specific part of a phase).

Much of the block read/write performance was improved through the introduction of batching interfaces. For reading, this mostly improved anything related to movement or collision. For writing, this dramatically improved cuboid generation.

Additionally, the only "big design change" which has been made is that the storage layer has been completely redone, which should improve on-disk file organization and dramatically reduce the problems of "lots of small files", which punishes the filesystem.

The default thread count is now auto-chosen based on configuration and how many resources the machine has (before, it was a hard-coded default of "4 TickRunner threads").

Currently, looking into what client-side changes are worth making for the graphics or even handling system. Events seem fine, so far, and I have made some small changes to how cuboids are selected for rendering, which provided a nice improvement. Still, I have a few other ideas I should try but I am not sure how far I should go before I end up with diminishing returns or "too exotic" such that maintenance or understand becomes a pain.

One thought is to combine all entity rigged limbs back into a single vertex array, just with additional binding data, so that all the positioning can be done on the GPU. This will reduce the number of calls from 5 to 1, at the cost of another uniform and much more vertex data. I am not sure if this will help much, if it is even worth fixing, or if this will just cause flexibility issues in the future. Note that this would still not include combining all of the entities into a single vertex array (as I want them to animate with frame rate, not some other cadence).

Another idea would be to maybe fold all passive items (or at least the items on the ground) into a single vertex array and just push the updates to the GPU when things are added/removed. This wouldn't be too difficult but the benefit is also uncertain.

A last idea I have been toying with is to dynamically combine cuboid meshes when they seem to not change much, so that there are fewer draw calls or state change calls (since each cuboid mesh is its own array, it requires its own bind and vertex attrib calls). This wouldn't be too difficult, would probably provide a nice performance boost, but may increase memory usage since it would require that the system storage of each cuboid mesh be kept alive (although I think that this is off-heap, so it is a waste, but wouldn't increase on-heap GC pressure).

Measuring these benefits is also a tricky thing, since the client is interactive and using real graphics hardware. I suspect that I might need to create a hidden set of "profile worlds" which just set up an interesting environment, just to draw the frames for profile measurement, non-interactively.

So, close to the end of the plan, and looking pretty good, but still a few things to consider,
Jeff.