Click here for a printer-friendly version of this article.
The Investigator #1: "Data Sets: A Research Primer"

Download The Investigator #1: "Data Sets on Volunteerism: A Research Primer" (Adobe PDF format)

What is a data set?

A data set is a collection of information, typically about a certain topic. Most of the data sets used in the Investigator series have several thousand observations. Data sets are also referred to as a sample of the population, or of the truth. For example, in order to know the true volunteering rate in the United States we must interview every person. However, a sample of the US population, asks a few people that are hopefully representative of the US population. A data set would not be representative if it only contains individuals that live in big cities, for example. From this sample of the population, we calculate the average number of people that volunteer. This is an estimate of the true number of people that volunteer. Depending on the size of the data set, we can say how certain we are that this estimate is close to the truth. The more observations that are available, or the larger the data set, the more accurate are estimates about the population.

Why do data sets matter?

As mentioned in the example above, if a data set is not representative of the population of interest, the US in our case, then the estimates from the data set will not be close to the truth. Often, characteristics of the observations, such as the level of education, number of children, and gender, for given geographic regions, are compared with the US Census. If the characteristics are close then the data set is probably representative of the US. (However, you may recognize that the US Census is also a data set. But, in most cases, it is much larger than any other data set and so is much closer to the truth.)

To collect data, survey designers typically create a questionnaire that is distributed to individuals for example, via phone-calls or person-to-person interviews. The wording of the questionnaire may illicit different responses. For example, a questionnaire that asks individuals about volunteering three times during the interview is likely to report a higher volunteering rate than a questionnaire that asks only once. Since the question "Did you volunteer last year?" requires individuals to recall information, asking the same question multiple times increases the chances of remembering.

Other concerns over data sets abound.

What are the characteristics of a high quality data set?

Some characteristics are:

  • The sample (or data set) is representative of the population of interest. This includes when the data set was conducted. Recent data sets are likely to be a better sample of the current population than a data set conducted 20 years ago.
  • The more observations, the better the estimates.
  • Good questions.

Alphabet Soup for Understanding Research

Every field has jargon and acronyms. Researchers are no exception. This glossary provides a quick reference for some of the abbreviations you may encounter with the Investigator series.

Please let us hear from you. You may email the director of the Investigator series with your questions and comments at investigator@rgkcenter.org