The problem I've faced this last 6 months has been to try and get a better understanding of the performance characteristics of the systems where I work. This isn't really a technical post - although I do have some links to sample code on my Github repository, it's mostly a background perspective on what I've found helpful and why - and it needs to be taken from the perspective of someone that doesn't code for a job - I'm mostly pushing out powerpoints and word docs but when the opportunity allows I love the chance to get into a bit of the data and make something of it.
To get a bit more specific in this case, my co-workers and I need to get a good handle on the performance characteristics of a large legacy technology system that's getting replaced; it's at the heart of a customer facing system with upwards of a million customers so performance is very important. The profiling is essential for defining non-functional requirements (terrible term - qualities of service is so much better but the industry can't seem to get away from it), and it's those NFRs which are often misrepresented, misunderstood and yet become crucial to successful delivery.
So what do we do and why has F# been helpful? Okay there’s probably more steps but for me I'm thinking of 3 things.
- We have to baseline existing performance which means lots of analysis.
- Then we set our target requirements based on the baseline and maybe introduce some tougher targets, because why not, it’s a new and hopefully better thing you’re developing/configuring/deploying.
- Then we need to verify through testing under load.
So why is that?
Well let's looking at profiling, what that means is collecting lots of data; it’s coming from log files and databases, and it’s got shape, it can all be described as being of some type and F# is really, really good at dealing with data that's got some form of shape, some constraints or rules.
I've previously used Perl, Python, R, Excel and even VBA; they're all popular choices, but that ability to type the data starts to become really useful when trying to get to grips with the edge cases that you trip up on later, or when you start moving into larger volumes of data. The dynamic typing approach you get with Perl, Python and R is great for getting you started but then you so often keep hitting the edge cases as you start processing larger volumes of data and it's sometimes really hard to figure out why because it's basically just happening at runtime. The strong upfront typing in F# (and other statically typed languages) is very, very handy in these circumstances. So that's reason number one for me: static inferred data typing which tells you what's going on right up front.
OK, so how about another reason. Data manipulation. Let's say you get the data into memory and you're working on it. The REPL in combination with the inferred typing, functional and non-functional data structures (Arrays, Seqs, Deedle data frames etc) and functional approaches like Maps makes data transformation a whiz. OK, Microsoft's SQL Server is pretty good - I/we use it a lot but while it's a truly wonderful database, Transact-SQL is just not the best approach to dealing with many parsing or analysis problems, you rapidly end up playing tricks like constructing XML and doing this. Nope, F# rocks for interactive data transformation, it's definitely reason number two.
The third big reason for me is the broader .NET ecosystem. It just works well together. Plugging libraries in isn't hard. Things tend to work and the last few years have seen packages on Nuget expand dramatically. It's a wonderful thing to pull up the Accord libraries, or Deedle, or the many type providers, or FsLab and just get it all working straight away.
The last reason has to be brought up. This is an awesome community. People help and they're positive. That's a helluva a plus when you're just trying to learn.
Ok, so those are the reasons why I find F# works so well for me, but let's go into a little more detail and go through a typical workflow in trying to do this performance profiling.
Step 1 is reading log data. Most often it's going to be with File.ReadAllLines, or SQL data via one of the SQL type providers. Most enterprise environments will also have a variety of monitoring systems to query - we happen to have Microsoft’s System Center Operations Manager installed with a data warehouse that contains multiple years worth of performance metrics. The SQL type providers make reading this easy. Here's a question - has anyone considered a type provider for IntelliTrace files? I can't seem to find any information on how to query these outside of Visual Studio but would love to know.
I like regular expressions and active patterns to parse text data – it helps tease out edge cases but if there’s a simple structure then basic string operations like split can be fast and simple. In practice I haven’t had much luck with the CSV type provider – too many oddities in the files I seem to come up against - far too many of them effectively spread a record across multiple lines, and I very rarely come across simple csv data. In the example code I've got a typical example of a multi-line file that needs to be pre-parsed before handing off to the ReadCsv method in Deedle.
So the data gets into local memory and that's good enough for most of the problems we face, in fact you can make a pretty good argument that big RAM is beating big data. Representing it in memory is usually a case of an Array, or a Deedle data frame. Data frames are very handy but there's a bit of a learning curve due to the increased flexibility. (Fortunately I see a new book has just been published - I need to read it!)
Next step's usually a series of transformations. Using Array.mapi means if something goes wrong I can get a line number. Array.Parallel.mapi is nice when an individual task can be particularly slow but often running interactively it doesn't provide a simple performance boost. Using the Some/None option types makes missing values explicit which turns out to be incredibly useful – missing values are everywhere in real-world data. It makes you explicitly handle the missing data case and then progressively unwrap like this.
Timing a series of piped-forward transformations can be useful to know where the lengthy steps are which can be easily done using a custom operator. (btw I'd love to know if there was a way to reflect back something readable about the nature of the line of the code being executed in the pipeline.)
So, let’s say we’ve transformed the data and we now have something like durations at certain timestamps. What now?
Time to use FSLab and Accord.NET. Graphing distributions of response times or numbers of requests per time interval provides a great way of understanding the system you’re dealing with. The shapes can give some insight on the underlying processes that generate the distributions.Classic examples are random arrival events resulting in a Poisson distribution, measurement errors resulting in a normal Gaussian distribution or failure events resulting in a Weibull distribution. The Accord.NET author, César de Souza, has produced a very useful tool and covering article to getting a handle on different types of distributions with documentation to understand the underlying nature of the distribution and sample code for interactively working with them. In practice with the real system data you're going to find lots of skewing and probably multiple peaks due to competing processes - use your eyes to figure out what's happening.
I used FsLab for that visualisation, it incorporates both FSharp.Charting and XPlot.GoogleCharts. They both work well.
The Accord Framework is very broad - I've used it here for modelling, fitting and sampling distributions but it goes far further - machine learning, image processing etc.
With a good understanding of the baseline performance characteristics you’re in a position to set service level expectations for response times – basically, the QoS requirements - and with requirements then it means you'll want to be testing against them. So you'll either be sampling against real historical data or against a fitted distribution which can allow you push into the long tails. That's important because unless you take a really long historical sample you'll probably never get to test the extreme values, after all they're only rarely occurring but they're the ones that might bring your application to a grinding halt.
Calling into libraries, legacy systems etc to execute your tests is a piece of cake with F#. If it's a web service endpoint the WSDL type provider makes calling a service easy. The Async MailboxProcessor is a very simple way to run up parallel workers for stress testing.
I've put a sample fsx script into GitHub which shows a typical scenario that I might go through in the office. It starts with the retrieval and parsing of data using Deedle and the use of Accord.NET to find the best fit across common distribution types. There's a number of transformations involved and some sample questions, approaches to answering them and graphing tasks. The data in this case doesn't have enough input variables to be useful for a machine learning demo but it is very typical of what gets written from an older enterprise application log. (In retrospect it would have been interesting to include measures of the batch sizes and relative complexity of the processing steps. The we could have considered how the output processing times relate to multiple input variables.)
The last big part of this overall process is the presentation of the data back and the incorporation into documents and presentations. For me, Microsoft Office still rules the roost and the entry point for data is Excel. My default way (for example) to get data into it now is using Deedle's Frame.SaveCsv. This works well as part of a file processing pipeline:
File data -> import process -> transformation and analysis -> representation as an array of records -> Frame.ofRecords -> .SaveCsv. -> Excel and Word/PowerPoint drudgery.
So there you have it. An F# Advent posting simply because it deserves it. A very useful programming language, community and technology toolkit that's made many a problem go away this past year.
Bring on 2016.