Parse Integers

Task	#1
Position	9
Score	4,788
Best Score	2,911
Points	0.608

Problem Statement

You have a data stream of numbers in string representation sending to STDIN. Each line of the flow has one number in the interval [0; 2147483647]. For example:

Try to find the sum of these numbers for the shortest time. In case of an integer overflow, just ignore it.

Input Generator

The input is uniform random, and there are 50,000,000 integers.

Solution Sketch

This was the first problem I solved. After I hit a performance wall doing movemask + tzcnt and shift, I gave Matt Stuchlik’s solution a try. It took me a while to understand and work through it, and I’m happy it’s so well documented because I knew no SIMD programming at the time.

The key part of this solution is how the LUT is generated. One of the comments says:

Now we dereference the location to get the two mappings. This is a point where you should be a little confused:

We’re dereferencing the index by itself, not as an offset into an array base.

The total range of the lut_index variable is very roughly 0 to 2^42, 42 bits address space or some 4TB. That’s more than the 500MB RAM we have available and more than you could create with a normal array literal.

On the positive side the lookup table is very sparse: thanks to the specific input distribution, we only have O(1,000s) chunks of mappings spread over the whole 42 bit address space. This is the part that I’ll explain in more detail in a follow up post, but for now you can imagine as if we mmap and memcpy all the mappings to the right addresses at the start of the program. Let me know if you think you know how to do it ~zero-cost! :) Before I came up with this approach I used a derived index based on the size of the numbers in the chunk using chained tzcnt. This shrinks the index space to (barely) fit in a normal look up table, but the index computation was one long dependency chain with high latency instructions and ended up being almost 50% of the runtime.

I ended up doing the mmap+memcpy, and the shuffle masks themselves were precomputed by generating ~1000 times the input manually into a ringbuffer, striding through it in reverse and seeing what masks we encountered and how we’d need to shuffle. That combined with some aggressive unrolling and storing only part of the LUT to memory got me to 4.8k.

Thoughts

LUT construction does take up a nontrivial amount of time because you can’t mmap more than 40GB of memory at a time. My zero-cost guess would be to embed the look-up table information using the linker into the binary somehow.

The final look up table had to be stored entirely in source, since I didn’t have a nice way of generating it using constexprs. This led to a 2MB header file I had to include, which caused compilation times to spiral out of control. In addition, clangd would crash everytime I tried to edit this solution and clang-format would delete chunks of my code in order to save itself from formatting that mountain of excrement. Because this was my best solution, I also didn’t want to delete it and start from scratch because it would be way too much effort. In hindsight, I should have started afresh for a chance to do much better.

Aniruddha Deb

Explorer

Parse Integers

Problem Statement

Input Generator

Solution Sketch

Thoughts

Graph View

Table of Contents