| Task | #9 |
|---|---|
| Position | 6 |
| Score | 1,588,324 |
| Best Score | 1,278,416 |
| Points | 0.805 |
Problem Statement
You have a data stream of UUIDs as strings to STDIN. Each line of the flow has one UUID. For example:
01c39d6f-a905-41b5-b960-55f5be445bf8
790455cb-9813-4298-87ec-9ede1dc38e10
8ff38a09-f995-467b-9d07-5536f621402e
...
The number of UUIDs is 18 000 000. Try to sort them for the shortest time.
Input Generator
The UUIDs are uuidv4, so they’re uniform randomly generated.
Solution Sketch
Sorting the string UUIDs themselves is expensive, but it’s possible to compress a UUID to 128 bits (__m128i) using some SIMD instructions. There should be a reference implementation in the repo. When saving, we format the UUIDs from __m128i to strings.
After the UUIDs are loaded in 128-bit representation, partition and do two passes of ska_sort. Use mlock/munlock and madvise(..., MADV_WILLNEED|MADV_DONTNEED) before/after the sorting pass so your pages are not resident in memory. The whole set of uuids takes ~288 MB, so it won’t fit into memory. I parsed the first half, sorted, stored the second half of this (ie the second quarter of the file) to disk, then loaded the second half, sorted and kept it in memory.
The rationale behind the memory dance above is because we want to reduce I/O between the disk and memory. This requires keeping as much of the memory used at any moment, and only reading/writing when neccessary. Also Do NOT let the OS swap your pages. Swapping is an order of magnitude worse than managing the memory yourself because the OS just has heuristics (linux uses MG-LRU for swapping), and does not understand the problem structure, so you incur a lot of cost letting the OS do this for you. The nature of disk I/O also favours few, large chunked read/writes over many small single page read/writes.
Once this is done, do a two-way merge. Be aggressive with page management; I did some manual paging in/out with the above functions to keep the memory used manageable.
Since IO is from disk, use a buffered IO reader/writer with a large write buffer (I used 8 MB). You can find fast buffered writers online, or write your own. The writer itself does not matter as much as the buffering, and doing buffered writes makes a huge difference in performance. My solution became 1.5x faster by doing buffered writes.
Thoughts
This entire problem is a disk I/O optimization problem. Which is awesome, because HighLoad needs more problems like this. Most real-world workloads (eg databases, training datasets) still rely on optimizing disk I/O and reading/writing huge datasets from disk, and performance engineering is not just about improving your cache hit rate and filling your execution slots.