What is data?

Data are informational values (numbers, text, images, ...) that are used in research, business, policy, and other areas, usually with additional context.

dataset is a defined, intentional collection of data points (informational values) with at least minimal description. For example, if information was collected about how many times students used computers around campus, the resulting dataset might be a spreadsheet with column labels such as: start time, end time, duration, day of the week, date, location; along with values listed for each student use. 

A curated dataset includes further information, such as a READ ME or description document that describes the purpose for collecting the information, why and how it was collected in this way, and how the data was analyzed or used after collection. 

A data repository houses datasets. Data repositories may be openly accessible or restricted; they may have an open submission policy, or they may be focused on a particular topic, purpose, or community; they may preserve datasets for the long term; and they may provide additional tools or resources. Two common types of repositories exist:

  • Institutional: Hosted by a particular college or university, for example, DigitalCommons@CSB/SJU.
  • Disciplinary: Created and maintained by federal funding agencies or professional organizations.

Open data, increasingly provided by researchers and organizations, refers to datasets that are shared in some way with the general public. At a minimum, a detailed description of a dataset is provided, and some datasets may be viewable, downloadable, and/or shared in a way that allows for their re-use by others. Sometimes ethical, legal, or other restrictions prevent sharing of data, but open data initiatives are making more and more information available for all of us to learn from or use. 

Data citations are used to reference a dataset that has been published or described and made publicly available, to credit the source of the dataset.