Keeping track of useful things that I’ve read or watched.
Useful in the sense of “I can use this for databases”.
Lord of the
io_uring tutorial. Link.
- Key idea: avoid syscall and user/kernel copy overhead with shared ring buffers.
io_uring shares two ring buffers (submission queue SQ, completion queue CQ) between kernel and user space.
- Submissions can be batched before a notify (
io_uring_enter); the kernel can poll the SQ directly to avoid submission overhead for high frequency submissions.
- Completions are added to CQ by kernel (1:1 submissions:completions, tagged with identifier) in no particular guaranteed order.
- Most applications should just use
liburing, but that requires kernel 5.6+. Ubuntu 20.04 LTS only ships 5.4.
- Key idea: function, design, why, teacher, checklist, guide, comments are useful. Specifically, “what” comments are not necessarily harmful.
- Excellent example comments.
- Give a component a DescriptiveName. Use DescriptiveName to refer to DescriptiveName even if repetitive.
- Words to be banned or strongly reconsidered prior to use:
- Too much potential for ambiguity, allows for too much sloppiness in coding: this, that, it.
- Too much potential for run-on sentences: more than two commas, semicolon (make a new sentence instead), any sentence longer than two lines of comments.
- Just capitalize and punctuate every sentence. Much better looking.
2018 Nir Understanding Optimizers: Helping the Compiler Help You. Link.
- Compiler basic block = one entry, MUST run through all the code, one exit.
- Function inlining.
- Common perception: saves call overhead.
- Real superpower: increase basic block size.
- Constant propagation.
- Gotcha: rarely happens across function boundaries.
std::find_if won’t inline the predicate function at
- Pass by reference.
- Gotcha: it can be a global, e.g.
const T &foo where
T = bool.
- Now the compiler can’t optimize bools into a simple test.
- Extracting tests from loops.
- Specializing the loop calling function with
- Pay for binary size but little else in many use-cases.
- e.g., if only reading bools on startup.
std::variant generates some cursed assembly.
2019 Andrei Speed is Found in the Minds of People. Link.
- Quicksort trivia.
- 32 on VS, 16 on GNU, 30 on clang for trivial types and 6 otherwise.
- Know your metrics.
- Binary search has less comparisons than linear but takes longer.
- Linear search: one fail per search (less info), happy branch predictor.
- Binary search: one bit per search (more info), very sad branch predictor.
- Textbooks and research minimize
C(n), but entropy per comparison
matters more for performance.
- Sort algorithms are better measured with
- If performance matters, integrate tests with math.
right_idx = ... + (size & 1).
- Neat tricks.
- Unguarded insertion sort / presence of sentinel values for speed.
- Good mindset to have.
- But if you stare into the abyss..
Beej : Socket programming guide. Link.
L5RDMA : Sockets suck. Paper, Slides.
- Key idea: Networking has become a performance bottleneck; solve with RDMA/RoCE/shared memory. The paper does a cool bootstrap to the best available network tech.
- Silo: 58k txn/s per second single-threaded (embedded), 1.5k (TCP), 2.7k (domain socket). Networking and IPC are huge bottlenecks!
- Corollary: when investigating claims of high performance we should check: fsync frequency, embedded vs over network.
sled : Great benchmarking overview and experiment checklist. Link.
- P state and turbo boost, see sysbench.
- Drop the pagecache.
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
- Compact system-wide memory.
echo 1 | sudo tee /proc/sys/vm/compact_memory
speice : Great list of resources that emphasizes variance over mean. Link.
- This is one of the best “list of links” resources that I have read.
- “Focus first on reducing performance variance. Only look at average latency
once variance is at an acceptable level.”
- The rest of this blog is pretty high quality too.
sysbench : Machine tuning tips from sysbench maintainer. Link.
- Key idea: “If you are not seeing stable results in your performance comparisons, you are wasting your time”.
- CPU frequency:
- Disable higher P states.
echo performance | sudo tee /sys/devices/system/cpu/cpu[0-9]*/cpufreq/scaling_governor
- Disable higher C states.
(echo 0; cat) > /dev/cpu_dma_latency &
- Disable turbo boost.
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
- Scheduler: it depends.
- Leave it at CFS.
- Disable autogroup.
- Raise minimal granularity.
- Relevant Postgres discussion, also suggests tuning
- ASLR: Disable. (security risk!)
- NUMA: Disable autobalance.
- Swap: Minimize.
- THP: Disable.
echo never > /sys/kernel/mm/transparent_hugepage/enabled and
echo never > /sys/kernel/mm/transparent_hugepage/defrag.
- Memory allocator: Keep consistent version between benchmarks.
- Spectre/meltdown: Keep mitigations similar between benchmarks.