It has been a breath of fresh air, getting to play in the world of performance profiling and optimization, again.
While most of my initial ideas (more aggressive data caching) bore little fruit (there was a benefit which only slightly outweighed the cost), it has at least slightly improved the organization of the code in some more complicated areas. For example, much of the previous state merger logic between parallel phases had evolved into something nigh-on incomprehensible after several iterations of design changes and new concepts being introduced. This investigation became the reason to dramatically refactor it into distinct phases, which actually did slightly improve performance (a bunch of redundant data traversals were coalesced).
Now, however, I am on to the next phase of this where I need to start thinking about more exotic ways of solving problems. One such problem is the ever-present cost in reading a block type from storage. The octree storage representation is incredibly compact and ideal for walking the entire cuboid (as the client does for mesh building), but is very slow for reading or writing a specific block somewhere in the middle or end of the cuboid. On top of that, the path length down to that point further smears the profile data.
So, now I have started to wonder if the multiple levels of various high-level caches, and clever cache reuse has been mostly exhausted in terms of optimization opportunity. Maybe I need to start treating this more like a database, meaning I should query it in batches, not just read it like a basic data map.
This begs the question of how to implement this. Ideally, I could fit this into the existing Java Streams API but I am not sure I benefit from that, since I am the only consumer and I can probably do better in my specific use-case. Either way, this could be fun.
Further, there is the bigger matter of how much of the system's high-level logic needs to be re-written in order to capitalize on this. On one hand, I could just introduce some pre-pass to use this query system to pre-warm one of the high-level caches (then proceeding with existing logic), or I could look at a more aggressive re-write to push more logic into the query. Of course, I could do the first in order to open the door to second.
More thought needed on this but I am happy to find my thoughts turning to this more compete shift in approach at this point,
Jeff.