Basics

Rules

There are 22 challenges (20 when I started). Each challenge is like a LeetCode problem, with some constraints:

You have to process a large amount of input (couple hundred MB to a GB), and generate a large amount of output. Both stdin/stdout are hugepage backed in-memory files mounted on a tmpfs. The only exception is Sort UUIDs, where I/O goes through disk.
Your solution is run 9 times on 9 freshly-generated inputs, and has to provide a correct output each time to be valid.

For each valid solution, your score is the median time taken by your runs. Your best median time determines your leaderboard position.

mmap’ing the input

This is the bare minimum you have to do to be competitive. The wiki shows how to do this. In openload, the environment does this for you by default.

    #include <unistd.h>
    #include <sys/mman.h>
    ...
    off_t fsize = lseek(0, 0, SEEK_END);
    char* buffer = (char*)mmap(0, fsize, PROT_READ, MAP_PRIVATE | MAP_POPULATE, 0, 0);

Microarchitecture basics

I learnt this from various scattered sources and Hennessy/Patterson’s architecture book, but I’d see Matt Godbolt’s Jane Street Microarchitecture talk now, I wish this existed when I started. Covers almost everything you need to know. It’s about skylake so has some AVX512 stuff as well.

Data Parallelism and SIMD

Sergey Slotin, The Art of SIMD programming - Good primer if you don’t know what SIMD is. Sergey is also the author of https://en.algorithmica.org/hpc/ which is an amazing online resource on programming for performance.
OneLoneCoder, Intrinsic Functions - I like this because it shows the workflow and thought process for reasoning in SIMD, and how to write good SIMD programs. The fractal program he’s optimizing was written here (also a good video for general microarch stuff)
Intel Intrinsics Guide - Very handy when writing SIMD code.

There are more resources on this, some of which are linked in the wiki:

Daniel Lemire - SimdJSON author, lot of stuff on fast parsing
Wojciech Mula
John McCalpin
Agner Fog

Profiling and Benchmarking

https://easyperf.net/ - Denis Bakhvalov’s webpage. Has a bunch of articles on performance, an amazing free book and some exercises to practice
uICA - helpful for looking at code and finding long dependency chains, or for optimizing small snippets of assembly. Goes hand in hand with godbolt.

The Humble Timer

Michael Abrash’s Zen of Assembly Language introduces the Zen Timer in Chapter 2 with these words:

The Zen timer will be our primary tool for the remainder of The Zen of Assembly Language, so it’s essential that you learn what the Zen timer can do and how to use it.

I read this after I’d done most of the perf work on HighLoad, and realized the most used performance tool in my codebase was neither perf, nor chipmunk (the userspace PMU library I made). it was this 14-line timer:

class scoped_tsc_printer {
public:
    explicit scoped_tsc_printer(const std::string _name) noexcept
        : cycles(__rdtsc())
        , name(_name) { }
 
    ~scoped_tsc_printer() noexcept {
        LOG_INFO(name, ": ", __rdtsc() - cycles, " cyc");
    }
 
private:
    uint64_t cycles;
    std::string name;
};

You can find it in openload, I added the name print for a bit of clarity. This timer makes it incredibly easy to time code. All you have to do is this:

{
	scoped_tsc_printer tscp("code I want timed");
	
	// code here
}

and it will print the number of cycles taken according to your timestamp counter. To get seconds, you divide by the frequency, which for highload is approximately 3.6 GHz. It doesn’t work for timing things that take roughly the same time as the print itself, and you should definitely not use it in a loop. However, this is the best tool for finding out where you can cut fat. A profiler will only show bottlenecks with your current solution, but a timer will help you intuit a better solution.

Conclusion

Mastering these three concepts and solving every problem with them will break you into the top 5. Memory access optimizations are not as important because most of the time you’ll be striding linearly through memory. Other optimization ‘tricks’ are just problem solving, and have nothing to do with concepts. Most of the

Aniruddha Deb

Explorer