Levels of Representation, cont.

(Complications & Counter Measures)

INLS 525: Managing Electronic Records

Week 4 (2/5)

Presenter Notes

Mini-Assignment 3:

Files on Your Computer

Observations?

What were the main things you discovered from this exercise?

  • Things that reinforced your expectations.
  • Things that surprised you.

Presenter Notes

Mini-Assignment 3:

Files on Your Computer

Recordkeeping Implications

RIM should "ensure:

  • that adequate records are created to document business functions and meet administrative, legal, and other operational needs;
  • that recordkeeping requirements are analyzed and included when information systems are first developed;
  • that professionally sanctioned techniques are applied throughout the records life cycle;
  • that records are retained and disposed of based on analysis of their functions and value; and
  • that records of continuing value are preserved and accessible"

Stephens, David O. "Introduction: Status and Trends." Records Management: Making the Transition from Paper to Electronic. Lenexa, KS: ARMA, 2007. p.1 (emphasis in original)

Presenter Notes

Mini-Assignment 3:

Files on Your Computer

Recordkeeping Implications

Pretend someone whose records are important to you has just allowed you to use WinDirStat / Disk Inventory X to analyze the contents of a computer that he/she uses.

  • Which of the objectives on the previous slide could be supported by the kind of information you've just discussed? How?
  • What further information would you want, which is missing from the WinDirStat / Disk Inventory X views?

Presenter Notes

Representation Information

  • Representation and interpretation are complementary — interpretation depends on representation
  • Can have multiple interpretations of same representation, but...
  • Some representation schemes make certain transformations and uses easier than others

Presenter Notes

Every digital object is concurrently:

  • Physical object — "inscription of signs on some physical medium"
  • Logical object — "recognized and processed by software"
  • Conceptual object — "recognized and understood by a person, or in some cases recognized and processed by a computer application capable of executing business transactions"

Thibodeau, Kenneth. "Overview of Technological Approaches to Digital Preservation and Challenges in Coming Years." In The State of Digital Preservation: An International Perspective, 4-31: Council on Library and Information Resources, 2002.

Presenter Notes

Bitstreams mean nothing w/o context

A bitstream can represent any type of information.

Rothenberg, Jeff. "Ensuring the Longevity of Digital Information." Washington, DC: Council on Library and Information Resources, 1999.

Presenter Notes

Representation Information (RI)

  • "Information that maps a Data Object into more meaningful concepts" (OAIS) — makes humanly-perceptible properties happen
  • Examples: file format, encoding scheme, data type

Data object -Interpreted using-> Representation Information -Yields-> Information Object

Reference Model for an Open Archival Information System (OAIS). Consultative Committee for Space Data Systems, 2002.: Figure 2-2

Presenter Notes

RI can Reside in Many Places

  • Within digital object itself
  • Stored separately as metadata
  • Encoded within software required to read & parse the digital object (to discuss later)

Presenter Notes

Let's look inside some files...

Presenter Notes

A Web Page

Hex/ASCII view of a webpage

Presenter Notes

A PDF

Hex/ASCII view of a PDF

Presenter Notes

Identifying File Types

  • Magic numbers & file signatures
  • File extensions
  • Metadata stored in file system (e.g. Resource Fork)
  • MIME types (reported by webservers)

Presenter Notes

Magic Numbers & File Signatures

  • Distinct string or pattern that is found within files of a given type (most often in the header)
  • Most effective searches for magic numbers often involve regular expressions (e.g. grep) in order to indicate multiple variations of a pattern
  • Utilities that use this: file (Unix), TrID, DROID

Examples of file signatures: Word, PDF, JPEG, & ZIP

Presenter Notes

File Extensions

  • Changing file extension usually changes default application that OS uses to open (i.e. associates with) the file
  • The "8.3" (eight characters, followed by three-character extension) limit in the past — based on FAT — resulted in many creative uses of the extension portion of file name (e.g. reports1.994, april-94.rpt)
  • Convention is often still to use only three letters
  • No authority for standardizing use, so three-letter extensions are often shared by many formats
  • Security risks associated with trusting the file extension to be accurate — malicious code masquerading as another type of file (e.g. viruses sent as email attachments)

Presenter Notes

Layered Formats

Change a docx extension to zip; open in a zip viewer; open "document.xml".

Presenter Notes

Finding RI Outside the File

Registries of representation information types (file formats):

  • PRONOM
  • Digital Formats Site — Library of Congress
  • Global Digital Format Registry (GDFR) — Harvard, with Mellon funding
  • Unified Digital Format Registry (UDFR) — NDIIPP, combines PRONOM & GDFR
  • Representation Information Registry — Digital Curation Centre (UK)

Presenter Notes

Tools & frameworks...

...for monitoring, identifying & addressing obsolescence of representation information:

  • JHOVE (JSTOR/Harvard Object Validation Environment) — identify, validate, characterize
  • DROID (Digital Record Object Identification; uses PRONOM) — identify
  • FFIdent — identify
  • File Utility (windows) — identify
  • FITS (File Identification Tool Set) — wraps other tools
  • PANIC (Preservation web service Architecture for New media & Interactive Collections)
  • AONS (Automated Obsolescence Notification System)
  • kopal Library for Retrival and Ingest (koLibRI)

Presenter Notes

Complications:

  • Proprietary or insufficiently documented formats & obsolecence
  • Compression
  • Encryption

Presenter Notes

Obsolescence

Those who forget the past are condemned to reload it.
— Nick Montfort, July 2000

All layers undergo change over time, at varying rates.

Presenter Notes

"Long-Term"

"A period of time long enough for there to be concern about the impacts of changing technologies, including support for new media and data formats, and of a changing user community, on the information being held in a repository." (OAIS, emphasis added)

Presenter Notes

Risks Associated with Obsolescence

  • Vendor Lock-In
  • Legacy Data
  • Need for "Digital Archeology"

Presenter Notes

Compression (e.g. Run Length Encoding)

Run-length compression example.

Rothenberg, Jeff. "Ensuring the Longevity of Digital Information." Washington, DC: Council on Library and Information Resources, 1999.

Presenter Notes

3 Levels of Compression

  • Format of file implements compression internally — e.g. body of JPEG file is compressed but not file header
  • Application creates completely new, compressed copy of file(s) — e.g. WinZip, gzip
  • File system compresses data units — e.g. not writing data to series of sectors that are all filled with zeros

    Carrier, Brian. File System Forensic Analysis. Boston, MA: Addison-Wesley, 2005.

Presenter Notes

Encryption

  • Special data ("keys") and algorithms used to transform data into a form that is purposely less easy to read
  • Used for:
    • Confidentiality
    • Integrity
    • Non-repudiation
    • Authentication

Presenter Notes

Encryption at Various Levels

  • Application that creates the file
  • Application that reads an unencrypted file & creates an encrypted file
  • Operating System — "Before a file is written to disk, the OS encrypts the file and saves the cipher text to the data units. The non-content data, such as the file name and last access time, are typically not encrypted. The application that wrote the data does not know the file is encrypted on the disk.
  • Encrypt an entire volume — implemented in storage system below file system level

    Carrier, Brian. File System Forensic Analysis. Boston, MA: Addison-Wesley, 2005.

Presenter Notes

LOC's 7 Sustainability Factors

  1. Disclosure. Degree to which complete specifications and tools for validating technical integrity exist and are accessible to those creating and sustaining digital content.
  2. Adoption. Degree to which the format is already used by the primary creators, disseminators, or users of information resources.
  3. Transparency. Degree to which the digital representation is open to direct analysis with basic tools, such as human readability using a text-only editor.
  4. Self-documentation. Self-documenting digital objects contain basic descriptive, technical, and other administrative metadata.
  5. External Dependencies. Degree to which a particular format depends on particular hardware, operating system, or software for rendering or use and the predicted complexity of dealing with those dependencies in future technical environments.
  6. Impact of Patents. Degree to which the ability of archival institutions to sustain content in a format will be inhibited by patents.
  7. Technical Protection Mechanisms. Implementation of mechanisms such as encryption that prevent the preservation of content by a trusted repository.

Presenter Notes

Characterize

"To describe or delineate the character or peculiar qualities of (a person or thing)"

Oxford English Dictionary, Second Edition, 1989

Presenter Notes

Characterizations = Surrogate Representations

  • Assertions about specific aspects of the content or character of a digital object
  • Can be:
    • Very specific and machine-readable (e.g. file = 25 bytes; character-encoding = UTF-8, image is 48 pixels wide)
    • Relatively vague and human-readable (e.g. "This is a video from YouTube"; "I was able to view this on a Macintosh")
    • Or somewhere in between

Presenter Notes

What to Characterize?

Presenter Notes

Significant Properties

Whoever takes the decision that a particular digital object should be preserved will have to decide what properties are to be regarded as significant. The submission agreement could usefully specify a list of significant properties. (CEDARS)

Holdsworth, David, and Derek M. Sergeant. "A Blueprint for Representation Information in the OAIS Model." Paper presented at the IEEE Symposium on Mass Storage Systems, College Park, Maryland, USA, March 27-30, 2000.

"properties of digital objects that affect their quality, usability, rendering, and behaviour" (CAMiLEON)

Hedstrom, Margaret, and Christopher A. Lee. "Significant Properties of Digital Objects: Definitions, Applications, Implications." In Proceedings of the DLM-Forum 2002, Barcelona, 6-8 May 2002: @ccess and Preservation of Electronic Information: Best Practices and Solutions, 218-27. Luxembourg: Office for Official Publications of the European Communities, 2002.

Essence = "characteristics that must be preserved for the record to maintain its meaning over time"

Heslop, Helen, Simon Davis, and Andrew Wilson. "An Approach to the Preservation of Digital Records." National Archives of Australia, 2002.

Presenter Notes

Defining Significant Properties

  • Writing specific provisions into submission agreements
  • Developing criteria & empirical tools for evaluating preservation approaches
  • Documentation of preservation decisions in terms of specific properties
    • allowing professionals to revisit previous decisions
    • indicating to researchers what properties have not been retained

Presenter Notes

Emulation vs. Migration (Traditionally)

  • Emulation — Use of software to imitate obsolete computer equipment on new computer equipment, i.e. trick files and applications into thinking they're still running in their original environment
  • Transformation/Migration — Digital object that depends on obsolete computer equipment is changed in order to run directly on new equipment

Presenter Notes

Emulation

"To reproduce the action of or behave like (a different type of computer) with the aid of hardware or software designed to effect this; to run (a program, etc., written for another type of computer) by this means."

Oxford English Dictionary, Second Edition

Presenter Notes

Migration

  • Periodic transformation of the bits/bytes to run directly on newer platforms.
  • Used widely as an approach to actively managing legacy systems.
  • Work can be expensive and introduce errors of translation.
  • Since the resulting objects can run directly on newer platforms, layers of technology can be minimized.

Presenter Notes

Not Just "Emulation vs. Migration"

  • All strategies use standards in some way
  • General consensus to keep original bits
  • Transformation can be minor or extensive
  • Transformation/Emulation can take place
    • in the Producer environment,
    • upon Ingest,
    • as part of preservation activities within a repository,
    • or at time of access

Presenter Notes