Presentation Designing for performance

Scala provides a wide variety of productivity- and correctness-enhancing features, but some of those come at the cost of performance. I will discuss how to design Scala applications to take maximal advantage of Scala's best features while still yielding Java-like performance--or better since you can spend your time thinking and optimizing instead of writing boilerplate! Topics will include how and when to write microbenchmarks, the intrinsic speed of various common library routines, the pitfalls of the first-get-it-working-then-run-the-profiler mode of optimization when performance really matters (and how to let the profiler help you), and patterns to favor when you know a priori or via profiling or benchmarking that performance is critical.

Speakers


PDF: slides.pdf

Slides

Designing for Performance

Designing for Performance Rex Kerr HHMI Janelia Farm Research Campus Scala Days 2013 Rex Kerr (JFRC) Designing for Performance 1 / 37

Designing for Performance

Designing for Performance Some conventional wisdom, and when to be unconventional Getting the big picture from proling and why you can't count on getting the small picture Getting the small picture from microbenchmarking and why you can't count on getting the big picture Timings: the small picture from which performance is built Design guidelines for high-performance code This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 2 / 37

Designing for Performance

Designing for Performance Some conventional wisdom, and when to be unconventional Getting the big picture from proling and why you can't count on getting the small picture Getting the small picture from microbenchmarking and why you can't count on getting the big picture Timings: the small picture from which performance is built Design guidelines for high-performance code This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 2 / 37

Designing for Performance

Designing for Performance Some conventional wisdom, and when to be unconventional Getting the big picture from proling and why you can't count on getting the small picture Getting the small picture from microbenchmarking and why you can't count on getting the big picture Timings: the small picture from which performance is built Design guidelines for high-performance code This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 2 / 37

Designing for Performance

Designing for Performance Some conventional wisdom, and when to be unconventional Getting the big picture from proling and why you can't count on getting the small picture Getting the small picture from microbenchmarking and why you can't count on getting the big picture Timings: the small picture from which performance is built Design guidelines for high-performance code This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 2 / 37

Designing for Performance

Designing for Performance Some conventional wisdom, and when to be unconventional Getting the big picture from proling and why you can't count on getting the small picture Getting the small picture from microbenchmarking and why you can't count on getting the big picture Timings: the small picture from which performance is built Design guidelines for high-performance code This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 2 / 37

Outline

Outline 1 Conventional wisdom Three things you may have heard that may not be entirely true 2 Proling the Big Picture What you should and should not expect from your proler 3 Microbenchmarking the Small Picture How to write an actionable microbenchmark 4 Timings Building intuition about how long things take 5 Strategic Summary Suggestions for Performance-Aware Design Rex Kerr (JFRC) Designing for Performance 3 / 37

Outline

Outline 1 Conventional wisdom Three things you may have heard that may not be entirely true 2 Proling the Big Picture What you should and should not expect from your proler 3 Microbenchmarking the Small Picture How to write an actionable microbenchmark 4 Timings Building intuition about how long things take 5 Strategic Summary Suggestions for Performance-Aware Design Rex Kerr (JFRC) Designing for Performance 4 / 37

Wisdom #1

Wisdom #1 Premature optimization is the root of all evil Donald Knuth What is premature and what is mature? Premature optimization for speed is the root of all evil in Formula One racing (?!) In established engineering disciplines a 12% improvement, easily obtained, is never considered marginal and I believe the same viewpoint should prevail in software engineering. Donald Knuth Optimize wisely: know when and how; don't waste your time. This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 5 / 37

Wisdom #1

Wisdom #1 Premature optimization is the root of all evil Donald Knuth What is premature and what is mature? Premature optimization for speed is the root of all evil in Formula One racing (?!) In established engineering disciplines a 12% improvement, easily obtained, is never considered marginal and I believe the same viewpoint should prevail in software engineering. Donald Knuth Optimize wisely: know when and how; don't waste your time. This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 5 / 37

Wisdom #1

Wisdom #1 Premature optimization is the root of all evil Donald Knuth What is premature and what is mature? Premature optimization for speed is the root of all evil in Formula One racing (?!) In established engineering disciplines a 12% improvement, easily obtained, is never considered marginal and I believe the same viewpoint should prevail in software engineering. Donald Knuth Optimize wisely: know when and how; don't waste your time. This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 5 / 37

Wisdom #1

Wisdom #1 Premature optimization is the root of all evil Donald Knuth What is premature and what is mature? Premature optimization for speed is the root of all evil in Formula One racing (?!) In established engineering disciplines a 12% improvement, easily obtained, is never considered marginal and I believe the same viewpoint should prevail in software engineering. Donald Knuth Optimize wisely: know when and how; don't waste your time. This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 5 / 37

Wisdom #2

Wisdom #2 design rst, code from the design and then prole/benchmark the resulting code to see which parts should be optimized Wikipedia article on Program Optimization For this to be good advice, it assumes Proling will show you which parts are slow Code is modular: for any slow X , you can rewrite it as Xfast But neither of these is consistently true. Design your Formula One race car rst, and then test the resulting vehicle to see which parts can be modied to make the car faster. (?!?!) Anticipate performance problems and design to admit optimization, or build the performance-critical core rst. This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 6 / 37

Wisdom #2

Wisdom #2 design rst, code from the design and then prole/benchmark the resulting code to see which parts should be optimized Wikipedia article on Program Optimization For this to be good advice, it assumes Proling will show you which parts are slow Code is modular: for any slow X , you can rewrite it as Xfast But neither of these is consistently true. Design your Formula One race car rst, and then test the resulting vehicle to see which parts can be modied to make the car faster. (?!?!) Anticipate performance problems and design to admit optimization, or build the performance-critical core rst. This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 6 / 37

Wisdom #2

Wisdom #2 design rst, code from the design and then prole/benchmark the resulting code to see which parts should be optimized Wikipedia article on Program Optimization For this to be good advice, it assumes Proling will show you which parts are slow Code is modular: for any slow X , you can rewrite it as Xfast But neither of these is consistently true. Design your Formula One race car rst, and then test the resulting vehicle to see which parts can be modied to make the car faster. (?!?!) Anticipate performance problems and design to admit optimization, or build the performance-critical core rst. This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 6 / 37

Wisdom #2

Wisdom #2 design rst, code from the design and then prole/benchmark the resulting code to see which parts should be optimized Wikipedia article on Program Optimization For this to be good advice, it assumes Proling will show you which parts are slow Code is modular: for any slow X , you can rewrite it as Xfast But neither of these is consistently true. Design your Formula One race car rst, and then test the resulting vehicle to see which parts can be modied to make the car faster. (?!?!) Anticipate performance problems and design to admit optimization, or build the performance-critical core rst. This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 6 / 37

Wisdom #2

Wisdom #2 design rst, code from the design and then prole/benchmark the resulting code to see which parts should be optimized Wikipedia article on Program Optimization For this to be good advice, it assumes Proling will show you which parts are slow Code is modular: for any slow X , you can rewrite it as Xfast But neither of these is consistently true. Design your Formula One race car rst, and then test the resulting vehicle to see which parts can be modied to make the car faster. (?!?!) Anticipate performance problems and design to admit optimization, or build the performance-critical core rst. This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 6 / 37

Wisdom #3

Wisdom #3 The bottleneck isn't where you think it is. Even experienced programmers are very poor at predicting (guessing) where a computation will bog down. Various people on various blogs etc. is a skill. Like most skills it can be learned. You can't learn from nothing. You need data. Predicting performance problems This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 7 / 37

Wisdom #3

Wisdom #3 The bottleneck isn't where you think it is. Even experienced programmers are very poor at predicting (guessing) where a computation will bog down. Various people on various blogs etc. is a skill. Like most skills it can be learned. You can't learn from nothing. You need data. Predicting performance problems This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 7 / 37

Wisdom #3

Wisdom #3 The bottleneck isn't where you think it is. Even experienced programmers are very poor at predicting (guessing) where a computation will bog down. Various people on various blogs etc. is a skill. Like most skills it can be learned. You can't learn from nothing. You need data. Predicting performance problems This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 7 / 37

Wisdom #3

Wisdom #3 The bottleneck isn't where you think it is. Even experienced programmers are very poor at predicting (guessing) where a computation will bog down. Various people on various blogs etc. is a skill. Like most skills it can be learned. You can't learn from nothing. You need data. Predicting performance problems This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 7 / 37

Outline

Outline 1 Conventional wisdom Three things you may have heard that may not be entirely true 2 Proling the Big Picture What you should and should not expect from your proler 3 Microbenchmarking the Small Picture How to write an actionable microbenchmark 4 Timings Building intuition about how long things take 5 Strategic Summary Suggestions for Performance-Aware Design Rex Kerr (JFRC) Designing for Performance 8 / 37

What is the bottleneck in this code?

What is the bottleneck in this code? object ProfEx1 { val dict = Seq( "salmon", "cod", "grouper", "bass", "herring", "eel", "trout", "perch", "halibut", "dorado" ) def permuted = dict.permutations.map(_.mkString).to[Vector] def scanAll(sought: Seq[String]) = { def scan(s: String) = sought.exists(s contains _) permuted.filter(scan) } def report(sought: Seq[String], scanned: Seq[String]) = sought map { word => scanned find(_ contains word) match { case Some(s) => s"found $word in $s" case None => s"could not find $word" } } def printOut(lines: Seq[String]) = lines.foreach(println) def main(args: Array[String]) { val answer = report(args, scanAll(args)) printOut(answer) } } This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 9 / 37

Wait, maybe it's not even too slow.

Wait, maybe it's not even too slow. $ time scala -J-Xmx1G ProfEx1 snakes say sss could not find snakes could not find say found sss in codgrouperbasssalmonherringeeltroutperchhalibutdorado real 0m5.861s user 0m10.790s sys 0m1.080s Okay, that's slow. We need a proler? This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 10 / 37

The basics: what is a proler?

The basics: what is a proler? Provide information about time spent in various parts of code; may track memory usage, class loads, and other parameters also Broken roughly into two categories: instrumenting and sampling Instrumenting prolers rewrite the bytecode so that running a method will report information about it (e.g. number of times called, duration inside, etc.) Sampling prolers take snapshots of the stacks for each thread to infer where the most time is spent Oracle JVM has one built in (-Xrunhprof) and one external (VisualVM) IDEs may include one (e.g. Eclipse, NetBeans) Commercial prolers may have superior features (e.g. YourKit) This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 11 / 37

The basics: what is a proler?

The basics: what is a proler? Provide information about time spent in various parts of code; may track memory usage, class loads, and other parameters also Broken roughly into two categories: instrumenting and sampling Instrumenting prolers rewrite the bytecode so that running a method will report information about it (e.g. number of times called, duration inside, etc.) Sampling prolers take snapshots of the stacks for each thread to infer where the most time is spent Oracle JVM has one built in (-Xrunhprof) and one external (VisualVM) IDEs may include one (e.g. Eclipse, NetBeans) Commercial prolers may have superior features (e.g. YourKit) This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 11 / 37

The basics: what is a proler?

The basics: what is a proler? Provide information about time spent in various parts of code; may track memory usage, class loads, and other parameters also Broken roughly into two categories: instrumenting and sampling Instrumenting prolers rewrite the bytecode so that running a method will report information about it (e.g. number of times called, duration inside, etc.) Sampling prolers take snapshots of the stacks for each thread to infer where the most time is spent Oracle JVM has one built in (-Xrunhprof) and one external (VisualVM) IDEs may include one (e.g. Eclipse, NetBeans) Commercial prolers may have superior features (e.g. YourKit) This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 11 / 37

Instrumentation everywhere is terrible

Instrumentation everywhere is terrible Slow: extensive instrumentation greatly slows runtime $ time scala -J-Xmx1G -J-Xrunhprof:cpu=times ProfEx1 snakes say sss could not find snakes could not find say found sss in codgrouperbasssalmonherringeeltroutperchhalibutdorado Dumping CPU usage by timing methods ... done. real 137m36.535s user 138m24.740s sys 0m4.170s Rather inaccurate: JVM makes all sorts of dierent decisions about inlining, etc., with radically changed bytecode Instrumenting prolers will not reliably tell you where your bottlenecks are, and may not be deployable in the relevant context This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 12 / 37

Instrumentation everywhere is terrible

Instrumentation everywhere is terrible Slow: extensive instrumentation greatly slows runtime $ time scala -J-Xmx1G -J-Xrunhprof:cpu=times ProfEx1 snakes say sss could not find snakes could not find say found sss in codgrouperbasssalmonherringeeltroutperchhalibutdorado Dumping CPU usage by timing methods ... done. real 137m36.535s user 138m24.740s sys 0m4.170s Rather inaccurate: JVM makes all sorts of dierent decisions about inlining, etc., with radically changed bytecode Instrumenting prolers will not reliably tell you where your bottlenecks are, and may not be deployable in the relevant context This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 12 / 37

Instrumentation everywhere is terrible

Instrumentation everywhere is terrible Slow: extensive instrumentation greatly slows runtime $ time scala -J-Xmx1G -J-Xrunhprof:cpu=times ProfEx1 snakes say sss could not find snakes could not find say found sss in codgrouperbasssalmonherringeeltroutperchhalibutdorado Dumping CPU usage by timing methods ... done. real 137m36.535s user 138m24.740s sys 0m4.170s Rather inaccurate: JVM makes all sorts of dierent decisions about inlining, etc., with radically changed bytecode Instrumenting prolers will not reliably tell you where your bottlenecks are, and may not be deployable in the relevant context This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 12 / 37

Instrumentation everywhere is terrible

Instrumentation everywhere is terrible Slow: extensive instrumentation greatly slows runtime $ time scala -J-Xmx1G -J-Xrunhprof:cpu=times ProfEx1 snakes say sss could not find snakes could not find say found sss in codgrouperbasssalmonherringeeltroutperchhalibutdorado Dumping CPU usage by timing methods ... done. real 137m36.535s user 138m24.740s sys 0m4.170s Rather inaccurate: JVM makes all sorts of dierent decisions about inlining, etc., with radically changed bytecode Instrumenting prolers will not reliably tell you where your bottlenecks are, and may not be deployable in the relevant context This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 12 / 37

Sampling is worse than you think

Sampling is worse than you think JVM will not sample just anywhere! It selects safe locations for you. Evaluating the Accur acy of Java Profiler s Todd Mytkowicz Matthias Hauswirth Peter F. Sweeney University of Lugano Matthias.Hauswirth@unisi.ch IBM Research pfs@us.ibm.com Amer Diwan percent of overall execution University of Colorado at Boulder {mytkowit,diwan}@colorado.edu JavaParser.jj_scan_token NodeIterator.getPositionFromParent DefaultNameStep.evaluate 20 G 15 G 10 G 5 0 G G G hprof G G jprofile G G G xprof yourkit G Figure 1. Disagreement in the hottest method for benchmark pmd across four popular Java profilers. See also http://jeremymanson.blogspot.com/2010/07/why-many-profilers-have-serious.html This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 13 / 37

Proling our example

Proling our example By method (hprof output; run took 6.5 s): 1 2 3 4 5 21.74% 17.10% 10.14% 4.06% 3.19% 21.74% 38.84% 48.99% 53.04% 56.23% 75 59 35 14 11 300555 300582 300560 300568 300551 6 3.19% 59.42% 11 300565 7 2.61% 62.03% 9 300562 8 2.61% 64.64% 9 300586 9 2.32% 66.96% 10 2.03% 68.99% 8 300564 7 300559 scala.collection.mutable.StringBuilder.append java.lang.String.indexOf scala.collection.mutable.ArrayBuffer.foreach scala.collection.mutable.StringBuilder.append scala.collection.immutable.VectorPointer$class. gotoNextBlockStartWritable scala.collection.Iterator$$anon$11.next scala.collection.mutable.ArrayBuffer.$plus$plus$eq scala.collection.IndexedSeqOptimized$class. segmentLength scala.collection.TraversableOnce$class.mkString scala.collection.mutable.StringBuilder.append By line (analysis of hprof output): 64% #6 def permuted = dict.permutations.map(_.mkString).to[Vector] 22% #8 permuted.filter(scan) 7% (all the rest put together) 7% ?? (startup, etc.) Conclusion: making permuted strings is slow. This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 14 / 37

Proling our example

Proling our example By method (hprof output; run took 6.5 s): 1 2 3 4 5 21.74% 17.10% 10.14% 4.06% 3.19% 21.74% 38.84% 48.99% 53.04% 56.23% 75 59 35 14 11 300555 300582 300560 300568 300551 6 3.19% 59.42% 11 300565 7 2.61% 62.03% 9 300562 8 2.61% 64.64% 9 300586 9 2.32% 66.96% 10 2.03% 68.99% 8 300564 7 300559 scala.collection.mutable.StringBuilder.append java.lang.String.indexOf scala.collection.mutable.ArrayBuffer.foreach scala.collection.mutable.StringBuilder.append scala.collection.immutable.VectorPointer$class. gotoNextBlockStartWritable scala.collection.Iterator$$anon$11.next scala.collection.mutable.ArrayBuffer.$plus$plus$eq scala.collection.IndexedSeqOptimized$class. segmentLength scala.collection.TraversableOnce$class.mkString scala.collection.mutable.StringBuilder.append By line (analysis of hprof output): 64% #6 def permuted = dict.permutations.map(_.mkString).to[Vector] 22% #8 permuted.filter(scan) 7% (all the rest put together) 7% ?? (startup, etc.) Conclusion: making permuted strings is slow. This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 14 / 37

Proling our example

Proling our example By method (hprof output; run took 6.5 s): 1 2 3 4 5 21.74% 17.10% 10.14% 4.06% 3.19% 21.74% 38.84% 48.99% 53.04% 56.23% 75 59 35 14 11 300555 300582 300560 300568 300551 6 3.19% 59.42% 11 300565 7 2.61% 62.03% 9 300562 8 2.61% 64.64% 9 300586 9 2.32% 66.96% 10 2.03% 68.99% 8 300564 7 300559 scala.collection.mutable.StringBuilder.append java.lang.String.indexOf scala.collection.mutable.ArrayBuffer.foreach scala.collection.mutable.StringBuilder.append scala.collection.immutable.VectorPointer$class. gotoNextBlockStartWritable scala.collection.Iterator$$anon$11.next scala.collection.mutable.ArrayBuffer.$plus$plus$eq scala.collection.IndexedSeqOptimized$class. segmentLength scala.collection.TraversableOnce$class.mkString scala.collection.mutable.StringBuilder.append By line (analysis of hprof output): 64% #6 def permuted = dict.permutations.map(_.mkString).to[Vector] 22% #8 permuted.filter(scan) 7% (all the rest put together) 7% ?? (startup, etc.) Conclusion: making permuted strings is slow. This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 14 / 37

Checking proler accuracy with direct timing

Checking proler accuracy with direct timing object Ex1Time { val th = new ichi.bench.Thyme val dict = Seq( "salmon", "cod", "grouper", "bass", "herring", "eel", "trout", "perch", "halibut", "dorado" ) def permuted = th.ptime{ dict.permutations.map(_.mkString).to[Vector] } def scanAll(sought: Seq[String]) = { def scan(s: String) = sought.exists(s contains _) val p = permuted; th.ptime{ p.filter(scan) } } def report(sought: Seq[String], scanned: Seq[String]) = th.ptime{ sought map { word => scanned find(_ contains word) match { case Some(s) => s"found $word in $s" case None => s"could not find $word" } } } def printOut(lines: Seq[String]) = th.ptime{ lines.foreach(println) } def main(args: Array[String]) { val answer = report(args, scanAll(args)) printOut(answer) } } Rex Kerr (JFRC) Designing for Performance 15 / 37

Checking proler accuracy, cont.

Checking proler accuracy, cont. $ time scala -cp /home/kerrr/code/scala/github/Thyme/Thyme.jar:. -J-Xmx1G Ex1Time snakes say sss // permuted Elapsed time: ~1.835 s (inaccurate) Garbage collection (36 sweeps) took: 2.628 s Total time: 4.463 s // p.filter(scan) Elapsed time: ~983. ms (inaccurate) Garbage collection (1 sweeps) took: 12. ms Total time: 995.0 ms // Everything else < 100 ms real 0m6.070s user 0m12.270s sys 0m0.790s Close...I guess...75% / 16% vs. 64% / 22% This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 16 / 37

Proling bottom line: use it, don't trust it

Proling bottom line: use it, don't trust it Proling is good for Long-running processes Finding unexpected blocks in multithreaded applications Getting a general sense of which methods are expensive Proling is not good for Identifying the hottest method Identifying anything inlined Quantitatively assessing modest speed improvements If you need speed, design for speed. Use the proler to catch surprises. This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 17 / 37

Proling bottom line: use it, don't trust it

Proling bottom line: use it, don't trust it Proling is good for Long-running processes Finding unexpected blocks in multithreaded applications Getting a general sense of which methods are expensive Proling is not good for Identifying the hottest method Identifying anything inlined Quantitatively assessing modest speed improvements If you need speed, design for speed. Use the proler to catch surprises. This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 17 / 37

Proling bottom line: use it, don't trust it

Proling bottom line: use it, don't trust it Proling is good for Long-running processes Finding unexpected blocks in multithreaded applications Getting a general sense of which methods are expensive Proling is not good for Identifying the hottest method Identifying anything inlined Quantitatively assessing modest speed improvements If you need speed, design for speed. Use the proler to catch surprises. This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 17 / 37

Outline

Outline 1 Conventional wisdom Three things you may have heard that may not be entirely true 2 Proling the Big Picture What you should and should not expect from your proler 3 Microbenchmarking the Small Picture How to write an actionable microbenchmark 4 Timings Building intuition about how long things take 5 Strategic Summary Suggestions for Performance-Aware Design Rex Kerr (JFRC) Designing for Performance 18 / 37

Microbenchmarking seems almost impossible

Microbenchmarking seems almost impossible JVM/JIT compiler decides whether to compile your code (100x speed dierence) how much to inline whether it can elide multiple dispatch, branching, bounds-checking, etc. Can't measure anything fast due to poor timing utilities Context of a microbenchmark is surely dierent than production code Dierent GC load / JIT decisions / pattern of use The gold-standard tools (Google Caliper (Java), ScalaMeter (Scala), Criterium (Clojure), etc.) take a nontrivial investment of time to use: Not always easy to get working at all Require non-negligible infrastructure to run anything as a benchmark Do all sorts of things with class loaders and loading whole JVMs that take a while to complete (secondsminutes) This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 19 / 37

Microbenchmarking seems almost impossible

Microbenchmarking seems almost impossible JVM/JIT compiler decides whether to compile your code (100x speed dierence) how much to inline whether it can elide multiple dispatch, branching, bounds-checking, etc. Can't measure anything fast due to poor timing utilities Context of a microbenchmark is surely dierent than production code Dierent GC load / JIT decisions / pattern of use The gold-standard tools (Google Caliper (Java), ScalaMeter (Scala), Criterium (Clojure), etc.) take a nontrivial investment of time to use: Not always easy to get working at all Require non-negligible infrastructure to run anything as a benchmark Do all sorts of things with class loaders and loading whole JVMs that take a while to complete (secondsminutes) This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 19 / 37

Microbenchmarking seems almost impossible

Microbenchmarking seems almost impossible JVM/JIT compiler decides whether to compile your code (100x speed dierence) how much to inline whether it can elide multiple dispatch, branching, bounds-checking, etc. Can't measure anything fast due to poor timing utilities Context of a microbenchmark is surely dierent than production code Dierent GC load / JIT decisions / pattern of use The gold-standard tools (Google Caliper (Java), ScalaMeter (Scala), Criterium (Clojure), etc.) take a nontrivial investment of time to use: Not always easy to get working at all Require non-negligible infrastructure to run anything as a benchmark Do all sorts of things with class loaders and loading whole JVMs that take a while to complete (secondsminutes) This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 19 / 37

Microbenchmarking seems almost impossible

Microbenchmarking seems almost impossible JVM/JIT compiler decides whether to compile your code (100x speed dierence) how much to inline whether it can elide multiple dispatch, branching, bounds-checking, etc. Can't measure anything fast due to poor timing utilities Context of a microbenchmark is surely dierent than production code Dierent GC load / JIT decisions / pattern of use The gold-standard tools (Google Caliper (Java), ScalaMeter (Scala), Criterium (Clojure), etc.) take a nontrivial investment of time to use: Not always easy to get working at all Require non-negligible infrastructure to run anything as a benchmark Do all sorts of things with class loaders and loading whole JVMs that take a while to complete (secondsminutes) This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 19 / 37

Microbenchmarking usually works anyway

Microbenchmarking usually works anyway Most of the time: The hottest code is JITted anyway The hottest code is called a lot, so it's fair to batch calls in a loop If foo is faster than bar in some context, it is faster in most/all You can monitor or control for variability from GC, class loading, etc. by using JVM monitoring tools and robust statistics This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 20 / 37

Microbenchmarking usually works anyway

Microbenchmarking usually works anyway Most of the time: The hottest code is JITted anyway The hottest code is called a lot, so it's fair to batch calls in a loop If foo is faster than bar in some context, it is faster in most/all You can monitor or control for variability from GC, class loading, etc. by using JVM monitoring tools and robust statistics This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 20 / 37

Microbenchmarking usually works anyway

Microbenchmarking usually works anyway Most of the time: The hottest code is JITted anyway The hottest code is called a lot, so it's fair to batch calls in a loop If foo is faster than bar in some context, it is faster in most/all You can monitor or control for variability from GC, class loading, etc. by using JVM monitoring tools and robust statistics This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 20 / 37

Microbenchmarking usually works anyway

Microbenchmarking usually works anyway Most of the time: The hottest code is JITted anyway The hottest code is called a lot, so it's fair to batch calls in a loop If foo is faster than bar in some context, it is faster in most/all You can monitor or control for variability from GC, class loading, etc. by using JVM monitoring tools and robust statistics This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 20 / 37

Avoid the common pitfalls of microbenchmarking

Avoid the common pitfalls of microbenchmarking Read So You Want to Write a Micro-Benchmark by John Rose and the linked paper by Brian Goetz: https://wikis.oracle.com/display/HotSpotInternals/MicroBenchmarks Be aware of the top reasons why apparently correct microbenchmarks fail, including Real code requires multiple dispatch, test is single Real code runs with heavily impacted GC, test is not Real code uses results of computation, test does not Real code isn't even CPU bound, test is (ask a proler!) Use a benchmarking tool to get the details right. If you don't like the others, try Thymeit's lightweight and fast: https://github.com/Ichoran/thyme Just because a pattern is slow it does not follow that this is why your code is slow. impact = time × calls This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 21 / 37

Avoid the common pitfalls of microbenchmarking

Avoid the common pitfalls of microbenchmarking Read So You Want to Write a Micro-Benchmark by John Rose and the linked paper by Brian Goetz: https://wikis.oracle.com/display/HotSpotInternals/MicroBenchmarks Be aware of the top reasons why apparently correct microbenchmarks fail, including Real code requires multiple dispatch, test is single Real code runs with heavily impacted GC, test is not Real code uses results of computation, test does not Real code isn't even CPU bound, test is (ask a proler!) Use a benchmarking tool to get the details right. If you don't like the others, try Thymeit's lightweight and fast: https://github.com/Ichoran/thyme Just because a pattern is slow it does not follow that this is why your code is slow. impact = time × calls This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 21 / 37

Avoid the common pitfalls of microbenchmarking

Avoid the common pitfalls of microbenchmarking Read So You Want to Write a Micro-Benchmark by John Rose and the linked paper by Brian Goetz: https://wikis.oracle.com/display/HotSpotInternals/MicroBenchmarks Be aware of the top reasons why apparently correct microbenchmarks fail, including Real code requires multiple dispatch, test is single Real code runs with heavily impacted GC, test is not Real code uses results of computation, test does not Real code isn't even CPU bound, test is (ask a proler!) Use a benchmarking tool to get the details right. If you don't like the others, try Thymeit's lightweight and fast: https://github.com/Ichoran/thyme Just because a pattern is slow it does not follow that this is why your code is slow. impact = time × calls This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 21 / 37

Avoid the common pitfalls of microbenchmarking

Avoid the common pitfalls of microbenchmarking Read So You Want to Write a Micro-Benchmark by John Rose and the linked paper by Brian Goetz: https://wikis.oracle.com/display/HotSpotInternals/MicroBenchmarks Be aware of the top reasons why apparently correct microbenchmarks fail, including Real code requires multiple dispatch, test is single Real code runs with heavily impacted GC, test is not Real code uses results of computation, test does not Real code isn't even CPU bound, test is (ask a proler!) Use a benchmarking tool to get the details right. If you don't like the others, try Thymeit's lightweight and fast: https://github.com/Ichoran/thyme Just because a pattern is slow it does not follow that this is why your code is slow. impact = time × calls This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 21 / 37

Can microbenchmarking speed up our example?

Can microbenchmarking speed up our example? StringBuilder.append was hot. Can we do it faster with char arrays? object BenchEx1 { val dict = Seq( "salmon", "cod", "grouper", "bass", "herring", "eel", "trout", "perch", "halibut", "dorado" ) val cdict = dict.map(_.toCharArray).toArray val n = cdict.map(_.length).sum def main(args: Array[String]) { val th = new ichi.bench.Thyme val a = th.Warm{ dict.mkString } val b = th.Warm{ val c = new Array[Char](n) var i,j = 0 while (i < cdict.length) { System.arraycopy(cdict(i), 0, c, j, cdict(i).length) j += cdict(i).length i += 1 } new String(c) } th.pbenchOffWarm()(a, wtitle="mkString")(b, vtitle="charcat") } } Rex Kerr (JFRC) Designing for Performance 22 / 37

Microbenchmark + proler was actionable

Microbenchmark + proler was actionable $ scala -cp /jvm/Ichi.jar:. BenchEx1.scala Benchmark comparison (in 4.145 s) mkString vs charcat Significantly different (p ~= 0) Time ratio: 0.50403 95% CI 0.50107 - 0.50700 (n=20) mkString 250.5 ns 95% CI 249.5 ns - 251.6 ns charcat 126.3 ns 95% CI 125.7 ns - 126.8 ns Char arrays are almost twice as fast in a microbenchmark. Don't believe it! Does it hold in real code? Best of ve runs: Original With char array 0m6.400s 0m5.537s Timing on permuted method (best of 5): Original Runtime GC With char array 1.779 s 2.343 s 1.315 s 1.875 s This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 23 / 37

Microbenchmark + proler was actionable

Microbenchmark + proler was actionable $ scala -cp /jvm/Ichi.jar:. BenchEx1.scala Benchmark comparison (in 4.145 s) mkString vs charcat Significantly different (p ~= 0) Time ratio: 0.50403 95% CI 0.50107 - 0.50700 (n=20) mkString 250.5 ns 95% CI 249.5 ns - 251.6 ns charcat 126.3 ns 95% CI 125.7 ns - 126.8 ns Char arrays are almost twice as fast in a microbenchmark. Don't believe it! Does it hold in real code? Best of ve runs: Original With char array 0m6.400s 0m5.537s Timing on permuted method (best of 5): Original Runtime GC With char array 1.779 s 2.343 s 1.315 s 1.875 s This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 23 / 37

Microbenchmark + proler was actionable

Microbenchmark + proler was actionable $ scala -cp /jvm/Ichi.jar:. BenchEx1.scala Benchmark comparison (in 4.145 s) mkString vs charcat Significantly different (p ~= 0) Time ratio: 0.50403 95% CI 0.50107 - 0.50700 (n=20) mkString 250.5 ns 95% CI 249.5 ns - 251.6 ns charcat 126.3 ns 95% CI 125.7 ns - 126.8 ns Char arrays are almost twice as fast in a microbenchmark. Don't believe it! Does it hold in real code? Best of ve runs: Original With char array 0m6.400s 0m5.537s Timing on permuted method (best of 5): Original Runtime GC With char array 1.779 s 2.343 s 1.315 s 1.875 s This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 23 / 37

Outline

Outline 1 Conventional wisdom Three things you may have heard that may not be entirely true 2 Proling the Big Picture What you should and should not expect from your proler 3 Microbenchmarking the Small Picture How to write an actionable microbenchmark 4 Timings Building intuition about how long things take 5 Strategic Summary Suggestions for Performance-Aware Design Rex Kerr (JFRC) Designing for Performance 24 / 37

A word about timing methodology

A word about timing methodology All timings are from warmed Thyme microbenchmarks. Timings may have been subtracted from each other Assumes GC system is not heavily taxed These are guidelines not truths. If it's essential, measure in your context (architecture, JVM, etc. etc.). This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 25 / 37

A word about timing methodology

A word about timing methodology All timings are from warmed Thyme microbenchmarks. Timings may have been subtracted from each other Assumes GC system is not heavily taxed These are guidelines not truths. If it's essential, measure in your context (architecture, JVM, etc. etc.). This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 25 / 37

A word about timing methodology

A word about timing methodology All timings are from warmed Thyme microbenchmarks. Timings may have been subtracted from each other Assumes GC system is not heavily taxed These are guidelines not truths. If it's essential, measure in your context (architecture, JVM, etc. etc.). This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 25 / 37

A word about timing methodology

A word about timing methodology All timings are from warmed Thyme microbenchmarks. Timings may have been subtracted from each other Assumes GC system is not heavily taxed These are guidelines not truths. If it's essential, measure in your context (architecture, JVM, etc. etc.). This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 25 / 37

Boxing

Boxing Turtles: mutable, copy, f:T=>T, Shapeless lens Method vs. implicit class enriched method Method vs. value class enriched method Object: method vs. structural type Object method vs. boxed object method Array summation: ints vs. boxed ints Array creation: ints vs. boxed ints 1 10 100 1000 Nanoseconds per operation 10000 This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 26 / 37

Control ow

Control ow new exception with stack new control-flow exception local stackless preallocated exception inner for inner tailrec inner loop-with-return inner while loop manually unrolled loop-in-a-loop simple loop with match with if-else with indicator with & Iterator For (range) While loop with anon function Tail recursion While loop 1 Rex Kerr (JFRC) 10 100 1000 Nanoseconds per operation Designing for Performance 10000 27 / 37

Inheritance

Inheritance Multimorphic: 8 of 8, pattern match Multimorphic: 8 of 8, inheritance Multimorphic: 4 of 8, pattern match Multimorphic: 4 of 8, inheritance Multimorphic: 4 of 4, pattern match Multimorphic: 4 of 4, inheritance Multimorphic: 2 of 8, pattern match Multimorphic: 2 of 8, inheritance Multimorphic: 2 of 4, pattern match Multimorphic: 2 of 4, inheritance Multimorphic: 2 of 2, pattern match Multimorphic: 2 of 2, inheritance Method call typed as superclass Method call typed as implementing subclass Just code 1 10 100 1000 Nanoseconds per operation 10000 This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 28 / 37

Mathematics

Mathematics BigInt /, 1000 & 500 digits BigInt *, 1000 & 500 digits BigInt +, 1000 & 500 digits BigInt /, 100 & 50 digits BigInt *, 100 & 50 digits BigInt +, 100 & 50 digits BigInt /, 10 digits BigInt *, 10 digits BigInt +, 10 digits pow(x,0.3) sin(x) log(x) with / x with / 3.0 double loop with +, * with / x with / 3 with +, &, * int loop 1 Rex Kerr (JFRC) 10 100 1000 Nanoseconds per operation Designing for Performance 10000 29 / 37

tor

tor Lis t, A Se rray Bu Mat ffe r p Ve c ar ra y Collections / / / / / Map / Set Array / Vector List / best of class, object method while loop / index foreach ("Traversable") iterator while loop ("Iterable") fold map, sum view map, sum head-tail pattern match Range ArrayBuffer loop 1 10 100 1000 Nanoseconds per operation 10000 This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 30 / 37

Parallelization

Parallelization map and sum 2-element parallel list map and sum 2-element list scala.concurrent.Future java.util.concurrent.Exchanger Boxing Loop with self-synchronization (safe) Loop with atomic int addAndGet (safe) Loop with read-and-set @volatile (unsafe!) Loop within class Loop with synchronized update Loop with atomic int update Loop with @volatile update Loop Loop with synchronized test Loop with atomic int test Loop with @volatile test Loop 1 10 100 1000 Nanoseconds per operation 10000 This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 31 / 37

Outline

Outline 1 Conventional wisdom Three things you may have heard that may not be entirely true 2 Proling the Big Picture What you should and should not expect from your proler 3 Microbenchmarking the Small Picture How to write an actionable microbenchmark 4 Timings Building intuition about how long things take 5 Strategic Summary Suggestions for Performance-Aware Design Rex Kerr (JFRC) Designing for Performance 32 / 37

Step one: understand/dene requirements

Step one: understand/dene requirements Is performance a concern at all? Is performance in your control at all (is the slow part an external service?) Where does speed matter Visual system is happy with ~1020 ms worst case Anything interactive seems instant with latency of ≤100 ms Do you need to optimize for latency or throughput? This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 33 / 37

Step two: identify the likely bottlenecks

Step two: identify the likely bottlenecks What do you need to do a lot? That's probably where the bottleneck will be. Understand what a lot isadding a million ints is not a lot compared to a single ping across a typical network. Ask: are you using the right algorithms? Isolate performance-critical pieces in a modular way Use parallelism with the correct amount of work Overhead is considerable Deciding how to split is (may be) serial This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 34 / 37

Step three: measure performance early and often

Step three: measure performance early and often Set up so performance measurements are painless Only x immediately if performance is alarmingly bad and might require a complete redesign A system that does not work has zero performance Use the REPL to microbenchmark bits of code (You are already testing/building bits of code in the REPL, right?) Don't waste time measuring things that clearly don't matter (Your measurements will tell you what doesn't matter, right?) This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 35 / 37

Step four: rene working system

Step four: rene working system Catch surprises with a proler Get an idea of the big picture with a proler Rene hotspots by choosing a more ecient algorithm choosing higher-performance language constructs choosing a higher-performance library microbenchmarking (possibly in place in running code!) Don't forget that code needs to be maintainedif you do something really clever/nonobvious, try to encapsulate it and explain why it's done that way Don't fall victim to never X rules. There are tradeos; make the compromises that serve you best. This space intentionally left blank for people in the back Rex Kerr (JFRC) Designing for Performance 36 / 37