In IT, I’ve always been taught that data is unprocessed information or raw facts. Even in my accounting studies, I collect and record data from various financial transactions, which I then analyze and compute to generate information. For example, an amount of $1,000,000 in the sales ledger by itself has little meaning to me, managers, or company owners. However, through computation and analysis, I can utilize this number and other elements of data in the accounting cycle to determine figures like profit and loss, which may aid forecasting and decision-making.

Nonetheless, “data” in humanities is not a discrete, fungible unit on which one can conduct operations to generate an outcome. So this bears the question of what exactly is data in the humanities?

Digital data, at its fundamental level, is binary (1s and 0s) and can be represented in linear, hierarchical, or multi-relational structures such as arrays, XML files, and graph-based databases.

Structured, semi-structured, and unstructured data can all be found in the humanities field. In a database, structured data consists of key-value pairs in which unique primary keys identify and manage other values in a record, making retrieval easier. Data in XML files is considered semi-structured data. Despite my unfamiliarity with this markup language, my prior knowledge of HTML has helped me to understand how the concept of tags is used in XML, like HTML, to markup a document and identify logical structures such as chapters, headers, and paragraphs. Through my studies of Digital Humanities, I have now been able to understand that markup used in XML, and HTML is comparable to annotating a book with a pen or using the ‘track changes tool in Microsoft Word.

Structured and unstructured data can both be categorized as “smart” data, provided they are clean and limited in volume. In digital data, the term “clean” refers to minimizing any flaws in the collection process. Additionally, smart data includes raw data, markup, annotations, and metadata. Since these elements are individually created, smart data is limited in volume due to the time-intensive work. Furthermore, smart data is governed by the Text Encoding Initiative, which I think to be one of the most confusing and difficult elements of the broad classifications of humanities, and I believe might be one of the reasons for smart data’s limited amount.

Finally, unstructured data, as the name indicates, is disorganized and does not follow a certain schema or structure, such as the body of an email. Unstructured data, also known as big data, is vast in volume, diversified, and produced continually by sensors or as a by-product of people’s actions.

These classifications have given me a better understanding of what data is in the humanities. So, to answer the question, data in the humanities can be digital, but it must reflect some component or object of humanistic research such as history, art, philosophy, or literature. It is selectively created and developed using features such as markup, annotations, and metadata. It includes humanists using cultural artefacts like books to comprehend another culture or timeline, or literary academics using knowledge of other eras and cultures to create the meaning of a text using digital technologies.

The increased digitization of data in the humanities has increased the need for data management technologies, such as Zotero, a bibliographic and citation management application. Whilst using Zotero, I saw the collaborative function by joining a group for a digital humanities class to share resources. This also enables open access to content as people can view other humanities research findings. Lastly, Zotero uses metadata, thereby ensuring there is digital preservation of historical knowledge and information.

600 words