Book Review: Visualizing Data
I'll be honest - I am no expert in data visualisation. I had not heard of Edward Tufte [1] before looking at this book and while I thought I had an idea about the topic, the book suggested to me I did not. Perhaps this makes me unable to judge the value of its content; but I prefer to think this means I can come at the work as a member of the target audience:
"...the book is for web designers who want to build more complex visualizations than their tools will allow. It's also for software engineers who want to become adept at writing software that represents data... Fundamentally this book is for people who have a data set, a curiosity to explore it, and an idea of what they want to communicate about it." [2]
As a researcher I certainly have the curiosity to explore data and regularly have the need to communicate something about it. It is with this in mind that I began reviewing Visualizing Data: Exploring and Explaining Data with the Processing Environment.
The book certainly hits the ground running, with Chapter One outlining a seven-step programme for data visualisation: Acquire, Parse, Filter, Mine, Represent, Refine, Interact. These steps are illustrated with a worked example and they are both useful and understandable by someone unfamiliar with the "process of visualisation". Perhaps it helps that I'm a software developer and "The Process" is reminiscent of iterative software development however. What is good is the way the author focuses on the solid foundations of a data visualisation before jumping in with things like "look at this snazzy reflection effect". That is to say, this book is clearly concerned with getting the right data, in the right format, before considering the graphics. I tend to approve of such ground-up approaches. Be clear though, if you are looking for a book about whizzy graphics and pretty pictures, this is not the book you want.
Chapter Two introduces the Processing software [3] used throughout the rest of the book and in many ways Processing (not visualising data) is the star of this work – to the extent that one Amazon reviewer suggests reversing the Title and the Sub-title. I tend to agree.
The Processing software describes itself as:
"...an open source programming language and environment for people who want to program images, animation, and interactions. It is used by students, artists, designers, researchers, and hobbyists for learning, prototyping, and production. It is created to teach fundamentals of computer programming within a visual context and to serve as a software sketchbook and professional production tool. Processing is an alternative to proprietary software tools in the same domain."[4]
which makes it sound like a competitor to the likes of Flash, though Processing's FAQ would say otherwise. I'm not familiar with Flash so cannot compare, but I am impressed by Processing.
Essentially Processing provides a self-contained development environment (the PDE) and Java-like programming language to create, as it states, images, animation and interactions. It is clearly suited to data visualisation problems and has enough power to address each of the steps outlined in Chapter One from reading and parsing (Acquire) data files to handling mouse clicks on the visualisation itself (Interaction). Perhaps one of the nicest features is the ability to export any work as a Java application and/or applet so it can be shared.
If Processing is the focus of the book, then you might now be discounting Data Visualisation in the same way you might a book on Photoshop, as being about using an expensive software package. However, Processing is released under the GNU Public License [5], ensuring that Processing itself remains firmly open source, while any products created with it can be licensed as the user wishes (including commercial use). Worried now it is Open Source? Processing it seems has an active community around it and has been developed and used for seven years, which is reassuring for an Open Source software project.
Back to the book: Having very briefly familiarised the reader with Processing, the next six chapters present worked examples addressing different data types: Mapping (Chapter 3), Time Series (Chapter 4), Connections and Correlations (Chapter 5), Scatterplot Maps (Chapter 6), Trees, Hierarchies and Recursion (Chapter 7) and Networks and Graphs (Chapter 8). Each of these chapters addresses specific problems that arise from the type of data, for example, using recursion to traverse a directory tree in Chapter 6, or choosing appropriate axis, ranges and point-styles in Chapter 4.
Further, each chapter also outlines broader data visualisation principles. For example, Chapter 5, an exploration of the correlation between baseball team pay and team standing, illustrates a couple of methods to obtain data from the Web, including manual methods like using a spreadsheet program to obtain data from tables (useful for data that doesn't change very often). Such techniques are useful for non-programmers who might not be capable of scripting the data acquisition step as well as illustrating the author's pragmatic bent – why script when a manual process will suffice? The same chapter also touches on typography. In Chapter 3 we are presented with useful functions for handling mouse input. A nice touch is the way each heading within the chapters is linked back to the seven stages of data visualisation outlined in the first chapter by simply including the name of the stages in the headings.
The two penultimate chapters (Acquiring Data and Parsing Data, Chapters 9 and 10 respectively), to use the author's words, are "in cookbook-style" and as you might guess, designed to give Processing users the ability to discover and organise datasets from the Web as well as those created and/or stored locally. Chapter 9 looks at reading large data files, integrating Processing with MySQL [6] databases, and, of course, collecting data from the Web.
Chapter 10 is all about parsing and briefly discusses the different formats data might arrive in and how they might be processed for use in the visualisation. The chapter touches on screen scraping (including handily embedding Tidy into the Processing program), parsing text, HTML, XML (including SVG [7]), JSON [8], etc. before moving on to more specialised formats like DXF (AutoCAD)[9], PDF and Shapefile [10]. It then delves into the darkness of binary formats before getting dirty pulling apart Web pages by packet sniffing to find out where the Web pages get their data from. Discussion in these chapters is necessarily brief, but certainly provides the reader with plenty of ideas.
The book concludes with its most specialised chapter, "Integrating Processing with Java", which highlights the difference between Processing and the Java programming language (the book describes Processing as a "dialect" of Java) and how Processing might be used in a broader programming application. While only a few of the book's readers will need this chapter, it is a welcome addition none the less, extending the power and range of the Processing environment.
It is important to note that this book ramps up the requirement for technical ability and understanding quite quickly. The book itself admits that without a prior background in programming Chapters Five onwards will "quickly become more difficult". That said, it seems the author makes some attempts to accommodate varying degrees of programming knowledge throughout the text. However, these attempts sometimes lead to either frustration or confusion. Frustration because there might be an explanation of a programming concept that seems obvious to a programmer (Objects for example), and confusion because the explanation does not seem detailed enough for the beginner. The same can be said about the design and visualisation aspects of the text which sometimes left me, a beginner in the field, confused. I wondered if the same explanations would perhaps seem obvious to a designer. It's a difficult ground for an author, but I think deciding firmly on a level and audience for the text would help clarify the writing, if limit the readership.
To conclude, Visualising Data is a useful publication. Yes, it can be quite dry at times, but good data is needed for good visualisation and good data is a dry subject. Yes, it can be quite wordy at times, though that may be my prejudice against wordy text books. Yes, it sometimes tries to walk a middle ground of technical experience, which will leave beginners bemused and old lags bored, but, I suspect, outside computer science and design are large numbers of people who really are in this middle ground. Finally, do not be misled by the title. Visualising Data focuses mainly on Processing and how to use it to visualise data. In this it is successful, and some general principles can be picked up along the way. But I imagine there are better books (indeed a couple are listed in the work's bibliography) that explore data visualisation in general. So if you want an introduction to the subject, you would probably be best starting elsewhere, before returning to this book for a neat way to express your creative ideas.
References
- The Work of Edward Tufte and Graphics Press http://www.edwardtufte.com/tufte/
- Visualizing Data, Ben Fry, O'Reilly 2008, Preface, 2.1
- Processing 1.0 (Beta) http://www.processing.org/
- Ibid.
- GNU General Public License http://www.gnu.org/licenses/gpl.html
- MySQL http://www.mysql.com/
- Scalable Vector Graphics (SVG) http://www.w3.org/Graphics/SVG/
- JSON (JavaScript Object Notation) http://www.json.org/
- AutoCAD 2000 DXF Reference http://www.autodesk.com/techpubs/autocad/acad2000/dxf/
- ESRI Shapefile Technical Description http://www.esri.com/library/whitepapers/pdfs/shapefile.pdf