Converting Graphs to CDF
One of the unfortunate side-effects of working in systems or networking (i.e. the parts of computer science that deal with operating systems, datacenters, or other things, it is hard to define, there are conferences) is that reviewers, readers, and other members of the “community” require that results be tied to reality. The problem with reality is that it is pretty hard for graduate students to get access to reality.
Put another way it is non-trivial for some of us to get access to workloads, but we do actually need workloads to evaluate research better. A semi-common way for graduate students to get data is to somehow obtain a CDF indicating how frequently something happens.
Just recently, I needed to get one of these, and I e-mailed someone. They had a graph, however not the data used to generate this graph. This is not uncommon since the data belongs to companies, while graphs, once published, are public. It is widely believed that given one or more graduate students, and one or more such graphs, one can in fact end up with a workload that is representative of what is seen in the real world. However I am a graduate student, and since handing this off would just lead to an infinite recursion, I needed to convert the graph below, which someone had fortunately provided me with, into a set of traffic patterns I could then simulate.
My plan was to take this graph, produce a machine readable CDF, then write some code that would do rejection sampling to pick points based on this distribution, and then use these points in some meaningful way.
I was dumb and I decided to generate a CDF from the graph by overlaying a grid and reading off numbers. This is extremely boring, and unscientific according to David Zats, and Shivaram. Since I have a few more years of this grad school thing, it seemed like a reasonable idea to automate this process somewhat, since this would make the numbers more accurate, and since writing Python code is more fun than trying to look at a graph with a grid on top. That said, this automated mechanism is only marginally more scientific, a single graph, from a single source, is not a very general source of what the world looks like.
Also I had only recently installed some software from Adobe, thoughtfully provided by Berkeley. While this software is slow, hard to use, and has an annoying habit of bringing up upgrade dialogs all the time, it is marginally cool to play with. It also shows that I can use artsy software, and might with enough practice and time pass off as a renaissance man.
Anyways, smarter (I mean more patient) people than me automate these things by converting images to EPS. I have a hard time opening EPS in Python (the Python Image Library does not like EPS files generated by Adobe software), so I used SVG instead. The first step in the process was isolating the curve (i.e. removing axis, and other distracting artefacts) to be processed. This is fairly simple to accomplish with Illustrator, or Inkscape which is free. This extraction process left something similar to what is shown below.
It turns out Illustrator produces pretty easily readable SVG. All the SVG generated for either of those graphs involves either lines or Cubic Bézier Cuves (see how did the accent correctly). Lines are easy to deal with, and don’t really carry much information needing interpolation (at least in the case of CDFs). Bézier Curves are also pretty easy, but allow one to add a few points along the curve. I have some code written that converts a curve to a set of points, the code can be seen here.
Given this set of points, it is reasonably easy to go from points to CDF, depending on precisely what the axis are, and whether they are log scaled or not. I have somewhat incomplete code available online.
I hope this helps someone (other than me). You can get all of the code, and perhaps some future improvements at this Github repository.