Parse JSON

Task	#3
Position	2
Score	39,606
Best Score	38,681
Points	0.977

Problem Statement

You have a data stream of records to STDIN. Each record is encoded in JSON. The structure is:

    {
        "user_id": 0,                  // Max id is 10 000
        "currency": ""                 // One of "GBP", "USD", "RUB", "JPY", "CHF"
        "transactions": [              // Max count is 10
            {
                "amount": 0,           // Max value is 1000
                "to_user_id": 0,
                "canceled": false      // Can be missed if false
            }
        ]
    }

The fields order is not guaranteed.

Try to find the total amount of external transactions (record.user_id != record.transaction.to_user_id) in USD for the shortest time.

Input Generator

The output of probe.cpp is:

Amount range: [0,999]
User id range: [0,9999]
Txn user id range: [0,9999]
Num records: 1000000
Num USD records: 199503

Total Transactions: 4496378
  External: 4495932
  Explicitly cancelled: 2473701
    Explicitly cancelled to true: 450056
Total USD transactions: 898189
  External: 898104

Txn len counts: 
  0: 100309
  1: 99810
  2: 100319
  3: 100283
  4: 99994
  5: 99769
  6: 99995
  7: 99901
  8: 99597
  9: 100023
  10: 0

So:

USD records make up ~20% of records
Most USD transactions are external
Transaction numbers per record are uniformly distributed between 0 and 9

Solution Sketch

Llke most HighLoad tasks, this is not a JSON parsing task. SimdJSON gets a score of ~250k, so you need to find a way to eliminate most of the input without parsing it.

The key trick is to notice that USD records can be uniquely identified with the S character. That is, every S character in the input maps uniquely to one USD record. This lets us eliminate 80% of the input.

Once you find an S, there are six cases:

1. {"currency": "USD", "user_id": 123, "transactions": [...]}
2. {"currency": "USD", "transactions": [...], "user_id": 123}
3. {"user_id", 123, "currency": "USD", "transactions": [...]}
4. {"transactions": [...], "currency": "USD", "user_id", 123}
5. {"user_id", 123, "transactions": [...], "currency": "USD"}
6. {"transactions": [...], "user_id", 123, "currency": "USD"}

Depending on where you encounter the S, you either need to parse the transactions forward or backward. Optimize out the fast paths (eg empty transactions), and have two separate code paths for the forward and backward case.

For transactions, I used a switch with jumps and didn’t hardcode the six cases. Maybe you can do even more casework and make a lookup table for parsing. There are <100 possible lengths of a record, and maybe it’s possible to encode some extraction logic in each LUT cell. My hypothesis is it’ll be as fast, or slower than pure record parsing.

Push the parsed user IDs, amounts and canceled’s to arrays. Once you’ve processed the entire record and got the user ID out, use SIMD with some masking to isolate the amounts we need to add. Since they’re small enough, you can accumulate in parallel using maddubs, with the multiplier being 0 or 1 depending on whether the amount was invalid (cancelled, internal) or not respectively.

Thoughts

Like most parsing state machines, this was also a pain to debug, and not something I want to touch again. There are so many edge cases everywhere.

Some tips from making state machines: I use consteval int cslen(std::string_view s) { return s.size(); } for not having magic numbers when I’m doing pointer jumps when parsing. i += cslen(" }") is more understandable than i += 2, and helps immensely when debugging.

If you don’t want to do this crazy optimization, just make a tight parser with a bunch of labels and gotos. It’s good enough for ~60k.

Aniruddha Deb

Explorer

Parse JSON

Problem Statement

Input Generator

Solution Sketch

Thoughts

Graph View

Table of Contents