| Task | #1 |
|---|---|
| Position | 9 |
| Score | 4,788 |
| Best Score | 2,911 |
| Points | 0.608 |
Problem Statement
You have a data stream of numbers in string representation sending to STDIN. Each line of the flow has one number in the interval [0; 2147483647]. For example:
629871117
2024562523
1372689083
1021777120
2111176472
Try to find the sum of these numbers for the shortest time. In case of an integer overflow, just ignore it.
Input Generator
The input is uniform random, and there are 50,000,000 integers.
Solution Sketch
This was the first problem I solved. After I hit a performance wall doing movemask + tzcnt and shift, I gave Matt Stuchlik’s solution a try. It took me a while to understand and work through it, and I’m happy it’s so well documented because I knew no SIMD programming at the time.
The key part of this solution is how the LUT is generated. One of the comments says:
Now we dereference the location to get the two mappings. This is a point where you should be a little confused:
- We’re dereferencing the index by itself, not as an offset into an array base.
- The total range of the lut_index variable is very roughly 0 to 2^42, 42 bits address space or some 4TB. That’s more than the 500MB RAM we have available and more than you could create with a normal array literal.
On the positive side the lookup table is very sparse: thanks to the specific input distribution, we only have O(1,000s) chunks of mappings spread over the whole 42 bit address space. This is the part that I’ll explain in more detail in a follow up post, but for now you can imagine as if we
mmapandmemcpyall the mappings to the right addresses at the start of the program. Let me know if you think you know how to do it ~zero-cost! :) Before I came up with this approach I used a derived index based on the size of the numbers in the chunk using chainedtzcnt. This shrinks the index space to (barely) fit in a normal look up table, but the index computation was one long dependency chain with high latency instructions and ended up being almost 50% of the runtime.
I ended up doing the mmap+memcpy, and the shuffle masks themselves were precomputed by generating ~1000 times the input manually into a ringbuffer, striding through it in reverse and seeing what masks we encountered and how we’d need to shuffle. That combined with some aggressive unrolling and storing only part of the LUT to memory got me to 4.8k.
Thoughts
LUT construction does take up a nontrivial amount of time because you can’t mmap more than 40GB of memory at a time. My zero-cost guess would be to embed the look-up table information using the linker into the binary somehow.
The final look up table had to be stored entirely in source, since I didn’t have a nice way of generating it using constexprs. This led to a 2MB header file I had to include, which caused compilation times to spiral out of control. In addition, clangd would crash everytime I tried to edit this solution and clang-format would delete chunks of my code in order to save itself from formatting that mountain of excrement. Because this was my best solution, I also didn’t want to delete it and start from scratch because it would be way too much effort. In hindsight, I should have started afresh for a chance to do much better.