Mobile navigation


Data: asking the right questions

As publishers, we are not short of data. The problem many of us have is that much of it has been collected without enough thought being given to the questions or the structure, resulting in messy, hard-to-use data. Chris Matthews has some data collection tips.

By Chris Matthews

Several years ago, I was given a shoebox full of business cards that had been collected at a trade fair and was told to send them off to a data entry bureau so the contents could be imported into our new business database. That seemed simple enough.

When the resulting spreadsheet came back I noticed three problems that caused me a lot of extra work.

All the “department” information had been entered into the first address line, the name parts were all in one field (eg “Mr Chris Matthews”) and there were five address lines. Our database had a separate field for “department”, separate fields for “title”, “firstname” and “surname” and it only had four address lines.

Years later, I’m still seeing the same lack of foresight when it comes to planning data collection.

I pass this experience on to you, dear data collector, along with some other hints, tips and tricks that have helped make my working life a bit easier.

1. Think long and hard about the structure

Is the data you are collecting going to be added to, deduped against or compared to any other database? If so, you’ll make life much easier if you align the structures (fields, field types, field lengths and so on) wherever possible. Just think about my box of business cards.

2. Bracketing – when and when not to

It’s very common to see a questionnaire asking if you’re aged “25 to 44” or “45 to 60” and I can only think designers of such forms assume they’ll get more replies if the respondent only has to tick a box as opposed to write the number “32”. But suppose you want to find out how many respondents are aged between thirty and fifty?

On the other hand, there are some pieces of data that are better captured in brackets - job functions, for example. Ask a hundred working people what their job function is and you’ll get a hundred different answers, including a lot of job titles which is not the same thing at all.

3. Unique identifiers

This may seem blindingly obvious, but I still see it not done. Make sure your request for information always carries a unique identifier from the source file and that the identifier comes back with the response. This way you can find out if someone has moved or their job has been taken over by someone else and, if the source is your own database, it’s an easy way of getting a free update.

4. Pointers to help data integrity

Is it “UK” or “United Kingdom” or “Great Britain”? Use a drop-down list if you’re collecting via a web form but if 90% of your audience are in the UK, put it at the top of the list, even if “United Kingdom”, “Afghanistan”..... looks a little odd.

Time-related data (“how old are you?”) is going to be of questionable value unless you can assign an actual date to it. Include a calculated field (“date of birth”) and use that for the analysis instead.

It’s worth using the Royal Mail’s “Postcode Address File” (PAF) to clean and/or collect your addresses. This will help proper delivery, make deduping more effective and lessen problems caused by bad handwriting as you only need to input the postcode and house number. It’s available from Royal Mail or as a service from many data bureaux and mailing houses.

5. Analysing the “comments”?

Free text comments can be a can of worms and I advise thinking long and hard before giving your respondent an empty box to play in. If you only expect to have to read through a handful, it’s not too bad but if you get thousands and you try to analyse free text by pulling out keywords, you’ll pull “terrible” out of “despite some terrible reviews, I thought this product was first class” and call it a negative comment.

6. A different approach to surveys

It’s not uncommon for me to have to tell a client that I can’t answer the questions they’re asking about their customers or readers because the data they’ve collected won’t produce the answer. If I know how many children each respondent has and how many people live in each household, can I calculate how many households contain children? I haven’t asked how many children live at home.

I have found a back-to-front approach to surveys can produce more useable results. The marketers should start by devising the presentation they’re going to give at the end of the process, including their best guess at the actual numbers. It’s helpful to have set your expectations before seeing the real results because this can help highlight and explain surprises in the figures.

That presentation should then be passed to the data analyst who will design a dataset that can reliably produce those figures. The marketers and designers can then work on the survey with the analyst, making sure that every question is structured so that it feeds the answer into that dataset correctly. Just for fun, give a prize to the marketing person who comes closest to the real results.

I hope you find these tips as useful as I have in my quest to turn marketing data into useful marketing information.