Recognizing and Avoiding Pitfalls (TC18 Director's Cut #1)

Sep 24, 20187 min read

I’ll be giving a presentation entitled “You Are an Artist: How and Why to Get Started Making Public Visualizations” at Tableau Conference this October. As is always the case, I found myself with lots more to say on the topic than will fit into the time allotted.

So, in a move that’s half “director’s cut” and half “bonus materials,” I’ve decided to blog about a few of the topics that I really wanted to include as part of the presentation—some of my favorite parts, actually—but that just don’t fit with the overall flow and tone.

I hope this will give you an idea of what I will be talking about in October—I would very much like to see everyone who reads this in the audience in New Orleans—and also serves as an object lesson for one of my earlier posts, “Kill Your Darlings.”

Hope to see you at TC18!

[ #1: Recognizing and Avoiding Pitfalls | #2: Die, Vampire, Die | #3: Why Do We Race? ]

As you get comfortable with working on public visualizations, you’ll find yourself gaining a lot more technical ability than you had before. Tableau, the software tool, will become more comfortable and familiar—it may begin to feel almost like an extension of your body, than something you struggle to operate. You’ll also be used to the idea of having your work seen, liked, and commented upon by your peers in the world at large. This kind of regular practice and collaborative learning environment will help you build your skills for sure.

But in the process of doing so, you’re likely to encounter a few pitfalls—these are some hurdles and challenges that most of us encounter at one time or another. If you are aware that they’re out there, maybe you can avoid falling into these traps in the first place.

Over-Interpreting the Data

Sometimes you’ll be presented with a data set and be told, “Analyze this and see what you find.” You’ll work on it for awhile, and discover some interesting outliers or patterns that are notable. You create a visualization that reports on your finding, and you put it together in a nice finished package. It’s got a great title, shows your results clearly, and makes a strong statement about your analysis.

Hypothetically: you get a dataset that includes 10000 rows of data. That data has two columns: the first name of a male person, and the number of children that individual reports having. You do some analysis, and find some interesting anomalies. So, you make your charts, and then you write your title:

Why Don't Men Named Jayden and Liam Have Children?

And you go on to show how men named Michael and David and Daniel average more than 2 children apiece; all men on average have around 2 children apiece, but men named Liam and Jayden average close to 0.

Based on the data you have in front of you, this is factually true. But you’re missing so many different things!

Jayden and Liam are names that have only become popular in the 2000s. Men with these names haven’t had time to have lots of children. Most of them are children themselves. On the other hand, Michael, Daniel, and David have been popular names for decades.
You don’t report how many of each name there is in your data set. You can’t draw conclusions about all Jaydens if there is only one row with that name among the 10,000.
You don’t know when this data was collected. If it were collected in 2001 you’d expect to have 0 babies for Jayden and Liam, and maybe for Noah and Landon and Brayden and Bryce as well.
You don’t know how this data was collected. Was it from the Census Bureau? Was it from phone calls? Was it 10,000 people in Beverly Hills, or 10,000 people in Detroit, or 10,000 people in San Antonio? Was it even in the United States? Would you get different answers?
And finally, the way you’re phrasing your title implies that there’s causality in the name. But a person’s name doesn’t have any effect on whether or not they have children (in most common cases).

We have all fallen into the trap of making assumptions about our data, especially when it is handed to us directly. But it’s important to question its provenance, to learn about how complete the data actually is, to seek out additional, contextualizing data when the first-level analysis turns up curious results.

And finally, don’t feel like you need to see relationships in the randomness. Any data set of any significant size is going to demonstrate some seemingly interesting patterns and oddities. But use some common sense: would it make logical sense if, for instance, this particular data set showed that people named Martin had 5.3 kids, on average?

Of course it would make logical sense! About one in every 1,500 babies in the last 50 years were named Martin; a data set of 10,000 rows could easily have as few as 3 rows of Martins; and if those three people had 11, 3, and 2 kids, that’s what you’d see. But that would mean that your sample collected one Martin with an abnormally large number of children, and two with about an average number.

Your data set doesn’t define the entire ground truth, even when it’s all factually accurate. Put on your deerstalker hat and be a detective. And be very careful about the conclusions you draw from your data—you don’t want to make assertions that could be scandalous, or incendiary, or even libelous, simply from misinterpreting the limited data in front of you.

Assuming the Audience Is…

I once heard a story, possibly fictional, about a professor, leading a writing seminar for graduate students, who had one of his pupils come to office hours for advice. The pupil was a fine writer but was not doing well in the seminar. He had always been praised for his ability and his vocabulary in the past, and so couldn’t understand why he was struggling. The professor said to this student,

The most important thing to remember is that nobody wants to read a single word that you write.

This sounds incredibly harsh, doesn’t it? But the message wasn’t, “Your writing is terrible.” The message was, “People don’t have time. They do have lots of other options. If your writing isn’t CONSTANTLY convincing them to read further, they will lose interest and move on.”

We have this same struggle in our dataviz design.

By the time we get to the point where we are presenting a finished product, we are VERY familiar with our data. We know what our message is, we know the seven other messages we discarded in favor of the one we chose to focus our viz on, and we know the implications of our findings. We understand the subject matter, the charts, and the flow of our product.

Our audience knows none of these things.

A very common pitfall is to assume that the audience fits any or all of these characterizations:

Knows as much as we do about the subject
WANTS to know about the subject
Understands what we are implying about the subject
Will take the time to engage with our design
Has all the time in the world

Some of this might be true for business dashboards, where your users DO know the data very well, and DON’T have the option not to engage with your dashboard, and DO understand the implications of the metrics displayed onscreen.

In public visualizations, none of this is necessarily true. Your audience is a random person on the street (where the Internet is the street).

A better approach, when deciding how much detail to include, or how much explanation a viz requires, is to assume that your average audience member:

Is intelligent
Has no prior knowledge of the topic
Is naturally curious
Has a five-second attention span

Do not make your audience work to understand anything in the viz. YOU do the work for them.

Don’t “imply” connections or ramifications of your findings that you could put on the page. Don’t overwhelm them with all kinds of information at once: give them something to get their attention (a title! A bold color! A nice design!); then teach them a bit more about the topic. Then show them something interesting about the data. Then explain why this is relevant. And so on.

Keep your audience engaged, five seconds at a time, and know that you will always care about the viz WAY more than they do….unless you make it interesting for them from start to finish.

But I Found it on Google!

Ah, the old “I found it on Google so it must be fine” problem. This is most often the case with images, when we try to spice up the visual punch of our dashboards. It’s SO EASY to find just about any image you want on Google Image search.

But the problem is, lots of those images are copyrighted. And as a creative yourself, you should probably not be on the “finders keepers” side of the intellectual property argument.

People make a living from their creative work. Illustrators, photographers, multimedia artists are all trying to survive off of the fruits of their labors. Don’t be that person who takes their hard work and repurposes it without their knowledge, permission, and/or compensation.

Besides, there are so many other ways to get imagery for free, legally.

Unsplash.com and Pixabay.com are great resources for free, perfectly 100% usable stock photography. In most cases you can just credit the photographer in a line of text in your finished product—or not even that, in some cases.
Flickr also allows you to search photos that are given the Creative Commons copyright license; again, in these cases you usually just have to credit the original photographer.
Thestocks.im is another site that I have occasionally used to find free imagery.
Thenounproject.com is another great one that requires you only to credit the original designer.

Look, a line of text crediting other creators is a small price to pay for using high-quality visual designs in your own data visualizations. Even Austin Kleon, in “Steal Like an Artist,” makes the distinction between honoring/remixing work and plagiarizing it.

Please, as a design professional, make sure you have the rights to use the images you include.

P.S.: This also refers to datasets. Not all data is open (whether you believe it should be or not). Some datasets can be “found” only because people have scraped those datasets in violation of a website’s terms of service. I don’t know whether this is a moral or an ethical issue—I defer to Bridget Cogley on those distinctions—but I do know that not everything you find through a websearch is freely available for reuse. Be careful out there.

#directorscut #bonusmaterial #TC18 #pitfall

A VIZ APART

Recognizing and Avoiding Pitfalls (TC18 Director's Cut #1)

Over-Interpreting the Data

Why Don't Men Named Jayden and Liam Have Children?

Assuming the Audience Is…

But I Found it on Google!

Recent Posts