Generative AI & Archival Preservation: The Hybrid Microfilm Case

Generative AI & Archival Preservation: The Hybrid Microfilm Case

1. Executive Summary

Generative AI is triggering a structural crisis in archival preservation theory that the global national archives community has not yet fully confronted. Three converging forces define this crisis: (1) the model collapse problem — where AI trained recursively on AI-generated data produces degraded, homogeneous outputs, threatening the integrity of the digital record corpus that future AI systems and historians will draw upon; (2) the authenticity and provenance collapse — where deepfakes, synthetic documents, and AI-mediated workflows erode the foundational trust that archives are obligated to maintain; and (3) the born-digital volume paradox — where the exponential growth of AI-generated records has made appraisal, ingestion, and preservation at scale impossible without AI, yet AI itself introduces authenticity risks that undermine the very records it helps process.

The community response, as evidenced by the Library of Congress's February 2026 Call-to-Action white paper, the InterPARES Trust AI project, the 2024 OAIS revision (ISO 14721:2025), and NARA's evolving Digital Preservation Framework, is overwhelmingly digital-centric and insufficient for century-scale archival survival. The evidence strongly supports a re-evaluation of the document information lifecycle model to identify the stage at which analogue non-digital preservation — specifically archival microfilm — should intervene as the permanent, AI-tamper-resistant preservation master.

2. Research Question & Scope

Primary question: How is generative AI destabilising archival document preservation globally, and at what specific stage of the document information lifecycle should a hybrid non-digital medium (microfilm) be deployed to guarantee permanent, authentic, and AI-resilient preservation?

Scope: Peer-reviewed research, major institutional reports (NARA, Library of Congress, DPC, InterPARES), ISO standards revisions, and archival science theory published 2020–2026. Excludes Micrographics Data publications.

3. Key Findings

3.1 The Model Collapse Problem: Corruption of the Digital Record Corpus

Research published in Nature by Shumailov et al. (2024) demonstrates that training generative AI indiscriminately on both real and AI-generated content — typically achieved by scraping data from the internet — can lead to a collapse in the model's ability to generate diverse, high-quality output. This is not merely a machine learning problem. For archivists, it represents an existential threat to the integrity of the future documentary record: as AI-generated text, images, and documents proliferate across the open web and institutional repositories, the authentic original sources against which historical records are verified are being progressively contaminated.

The Shumailov et al. paper shows that as AI-generated data proliferates, models trained on such data experience significant performance degradation due to feedback loops where models increasingly rely on lower-quality synthetic data, causing errors to compound over time — compromising the reliability of AI systems especially as synthetic data becomes more prevalent in training datasets.

The finding has generated considerable interest and debate, particularly given that current models have nearly exhausted the available data of clearly human-generated origin. For archives, this means the window in which AI can be reliably used to process and appraise historical records — without contaminating those records with synthetic interpretations — is closing. The implication for national archives is that the born-digital corpus being ingested today is increasingly of uncertain provenance, and the tools being used to process it are themselves susceptible to model collapse effects.

3.2 The Authenticity and Provenance Crisis in Archives

The Libraries, Archives and Museums (LAMs) community's main currency is trust — specifically the trust that the collections preserved and made accessible are what they say they are. AI technologies have captured the imagination of archivists and scholars in processing and accessing collections at scale. However, at the same time, essential Content Authenticity and Provenance (CAP) principles are at risk of being overlooked. The erosion of public trust in digital content distributed through news and social media is being observed in real time as more examples of actual or suspected deepfakes appear daily.

Released in February 2026 as a product of the C2PA for G+LAM Community of Practice, the white paper co-authored by Kate Murray (Library of Congress) and Joshua Sternfeld advocates for libraries, archives, and museums to take proactive and pragmatic steps to ensure that digital collections content — especially content impacted by AI at any point in its lifecycle — remains authentic, transparent, and verifiable from creation through access.

The authors argue that no single standard, tool, or institution can resolve AI's challenge to content authenticity and provenance. Instead, safeguarding robust CAP practices will require creative strategies grounded in human judgment and field-wide collaboration. This is a remarkable admission from the Library of Congress: the digital preservation community's own flagship institution is publicly acknowledging that its existing frameworks are insufficient.

A peer-reviewed paper in Archival Science (Springer, 2023) draws attention to how the transition from analogue to digital has scrambled core notions of the archival field, including the principle of provenance. In the digital environment, integrity is not a stable concept but a property of records verified by preservation of a change log over time, since it is inevitable that a digital document is constantly changed and reinterpreted. This is the crux of the problem: digital "integrity" is a procedural construct maintained by continuous active management — it has no physical analogue to the inherent tamper-evidence of silver-gelatin microfilm.

The scale of the deepfake and synthetic document threat has accelerated dramatically: digital document forgery rose by 244 percent in a single year in reported cases, and deepfake attacks occurred every five minutes in 2024. The erosion of the "seeing is believing" standard — now morphing into the "liar's dividend" — threatens official records as well as public media.

According to Inscribe's 2026 Document Fraud Report, one in every 16 documents analysed is fraudulent, and AI-generated fraud has grown five-fold in eight months. Deloitte estimates that GenAI fraud losses in the United States will climb from USD 12.3 billion in 2023 to USD 40 billion by 2027.

3.3 The Born-Digital Volume Paradox: AI as Both Threat and Necessity

The volume and complexity of digital information being created, and the rate at which it is generated, means that using computational appraisals such as AI is no longer a choice but a necessity. Born-digital government records must be reviewed to select historically significant documents for preservation and delete ephemeral information — a process that cannot be done manually.

Yet the same AI tools being deployed to manage this volume introduce the authenticity threats documented above. The InterPARES Trust AI project notes that AI technologies pose risks to authenticity and privacy in archival practices, necessitating robust governance frameworks, and that generative AI offers new methodologies for archival work but risks undermining trust in records if not handled responsibly.

NARA's October 2024 Strategic Framework announcement explicitly describes the agency's use of AI for metadata capture in microfilm digitisation, safeguarding personally identifiable information, and natural-language search queries in digitised records — demonstrating that AI is now embedded throughout the archival processing workflow. The paradox is that NARA is simultaneously the institution that has mandated a fully digital records environment (as of June 2024) while also deploying AI that itself introduces provenance risks to those records.

3.4 The OAIS Framework Revision: Acknowledging the New Reality

ISO 14721:2025 (OAIS Reference Model, third edition) was published in December 2024, cancelling and replacing the 2012 second edition. The revision includes additions to and clarifications of concepts and terminology, with changes described as too numerous to permit meaningful markup — a substantial technical revision reflecting the transformation of the preservation environment.

The standard defines long-term preservation as "long enough to be concerned with the impacts of changing technologies, as well as support for new media and data formats" — and explicitly notes that "long term may extend indefinitely." The model accommodates information that is inherently non-digital (such as a physical sample), and its modelling and preservation of such information is explicitly within scope. This is significant: the OAIS Reference Model — the foundational standard of all digital preservation practice globally — explicitly accommodates non-digital preservation objects. The preservation community has systematically ignored this provision.

The December 2024 OAIS update emphasises the importance of clearly defined preservation objectives, a substantive change aimed at enhancing the model's applicability and effectiveness.

3.5 The Digital Preservation Community's Own Acknowledgment of Risk

The Digital Preservation Coalition (DPC) points to the fragility of digital data, noting that metadata is essential for understanding the age and content of digital collections but is easily corrupted through the careless replatforming of systems, data transfer, or corruption — and is often therefore absent or unreliable. Digital data is also not immutable: bit decay, format corruption, and obsolescence can all place digital records at risk of loss.

The DPC has warned explicitly that if AI-generated content is not carefully managed, examples of AI being used to simulate text and images — which appear convincing but are fictitious — will find their way into official and legal processes and form a completely erroneous historical record. The volume of AI-generated content coming into archives represents a challenge with "profound implications."

NARA's own 2022–2026 Digital Preservation Strategy acknowledges the need to "analyze file formats and media formats that are received and determine potential obsolescence" on an ongoing basis, and to "migrate holdings onto new preservation storage media over time to mitigate media obsolescence risks." This perpetual migration cycle — the fundamental structural weakness of all-digital preservation — is candidly described by NARA itself as an ongoing operational requirement with no foreseeable end. By contrast, paper materials and printed media migrated to microform can be accessible for centuries if created and maintained under ideal conditions, compared to mere decades of physical stability offered by magnetic tape and disk or optical formats.

3.6 The Records Lifecycle Model and Where Non-Digital Preservation Must Intervene

The 'Records Continuum' theory — the dominant modern archival framework — integrates records management and archival practices to ensure the preservation of context and authenticity. Unlike the lifecycle model, which establishes clear boundaries between a record's activity, use, and management, the continuum sees no distinguishing factor between a record's value at creation and its historic value in the archive, aligning with the need to act quickly to preserve digital information with its focus simultaneously on use and preservation.

The custodial records lifecycle model views records as physical tangible objects whose life is equated to that of a living organism with segmented stages — from birth and use through to rebirth as archives. In this outlook, the medium which carries the record is prominent. This is precisely the theoretical framework that supports the case for microfilm intervention: in the lifecycle model, the medium matters and must be selected deliberately at the point of archival transfer.

Peer-reviewed research published in AI & Society (Canning & Jaillant, 2025) formally establishes that government professionals must work closely with archival institutions to ensure that important born-digital records are identified and preserved in the short term, before being transferred for long-term archiving and access — with the keywords explicitly including "records lifecycle" and "records disposal." This framing implies that the intervention point for permanent preservation occurs at or before the point of archival transfer — the moment when active management gives way to century-scale custodianship.

3.7 The Hybrid Preservation Model: Convergent Evidence

The Council on Library and Information Resources (CLIR) established the theoretical foundation of the hybrid approach: the vision is to enable institutions to leverage the investment already made in preservation microfilming by making collections available digitally, while maintaining the preservation master in analogue form. The problems of preserving digital files over time are described as formidable, and the CLIR explicitly states that no responsible custodian would assert that digitisation is preferable to microfilming as a preservation medium.

The InterPARES Trust AI project raises the key question that all archives must now confront: even if the data are preserved, will archives have the hardware and know-how needed to access, use, and see them in the future? The project aims to preserve today's digital artifacts for centuries to come. This is a question that microfilm, requiring only a light source and magnification, answers definitively.

4. Standards & Regulatory Landscape

Standard Issuing Body Current Version Relevance to This Research
ISO 14721 (OAIS) ISO/CCSDS 2025 (3rd ed.) Foundational digital preservation reference model; explicitly accommodates non-digital objects
ISO 18906 ISO 2023 Archival microfilm specification; basis for 500-year lifespan claims
ANSI/AIIM MS23 AIIM Current Operational procedures for microfilm production to archival standard
ANSI/AIIM TR31 AIIM Current Legal acceptance of microfilm in court — relevant as AI document forgery rises
PREMIS Library of Congress 2025 Preservation metadata standard; now being re-evaluated in context of AI
NDSA Levels NDSA 2026 updated Levels of digital preservation framework; updated as AI challenges emerge

Key regulatory development: As of June 30, 2024, NARA no longer accepts paper records from Federal agencies, with limited exceptions, holding over one billion files representing more than 700 file format versions. This all-digital mandate creates an enormous at-risk corpus requiring permanent hybrid archiving intervention to avoid catastrophic loss.

5. The Document Lifecycle: Recommended Hybrid Intervention Points

Based on the evidence above, the following framework maps generative AI risks to lifecycle stages and identifies where microfilm must intervene:

Stage 1 — Creation / Generation AI-generated or AI-assisted documents enter the record corpus. Risk: Synthetic provenance, deepfake content, model-collapse-contaminated data. Intervention: Metadata provenance tagging (C2PA), human-in-the-loop authentication at creation stage.

Stage 2 — Active Use / Management (0–5 years) Records are in active operational use. Digital systems appropriate. Risk: Format change, metadata corruption, ransomware. Intervention: Standard digital records management + backup.

Stage 3 — Semi-Current / Appraisal (5–20 years) Records are appraised for permanent retention. This is the critical intervention point. Records with permanent archival value — legal, governmental, heritage — should be committed to microfilm at this stage. Digital continues as the access medium. Rationale: Government professionals must work to ensure important born-digital records are identified and preserved in the short term before being transferred for long-term archiving. The transfer point is where the permanent medium must be selected.

Stage 4 — Archival Transfer / Permanent Custody (20+ years) Records enter the permanent archive. Intervention: Microfilm becomes the preservation master. Digital copy maintained for access, search, and AI processing. This is exactly the hybrid model articulated by CLIR and implicitly endorsed by OAIS's accommodation of non-digital objects.

Stage 5 — Century-Scale Preservation (100–500+ years) Digital formats will have been migrated multiple times; formats in use today will not exist. Microfilm requires no migration. Medium independence: a light source and magnification will always be available.

6. Full Reference List

  1. Canning, D. & Jaillant, L. (2025). "AI to review government records: new work to unlock historically significant digital records." AI & Society. DOI: 10.1007/s00146-025-02221-0. Published 22 February 2025. https://pmc.ncbi.nlm.nih.gov/articles/PMC12442487/
  2. Council on Library and Information Resources (CLIR). "Digital Imaging and Preservation Microfilm: the Future of the Hybrid Approach for the Preservation of Brittle Books." CLIR Publications. https://www.clir.org/pubs/archives/hybridintro/
  3. Digital Preservation Coalition (DPC). (2024). Global 'Bit List' of Endangered Digital Species — 2024 Interim Report. Retrieved from https://www.dpconline.org
  4. Digital Preservation Coalition (DPC). (2024). "DP and Artificial Intelligence — A Four Point Plan." Blog. Retrieved from https://www.dpconline.org/blog/dp-and-artificial-intelligence-a-4-point-plan
  5. Inscribe. (2026). Document Fraud Report 2026. Cited in TrueScreen analysis, March 2026. https://truescreen.io/articles/document-forgery-ai/
  6. InterPARES Trust AI (ITrustAI). (2024). Artificial Intelligence and Documentary Heritage. SCEaR Newsletter Special Issue 2024. UNESCO Memory of the World Sub-Committee. Retrieved from https://interparestrustai.org
  7. ISO. (2025). ISO 14721:2025 — Space Data System Practices: Reference Model for an Open Archival Information System (OAIS), 3rd edition. International Organization for Standardization, Geneva. https://www.iso.org/standard/87471.html (replaces ISO 14721:2012)
  8. ISO. (2023). ISO 18906:2023 — Imaging materials — Photographic films — Specifications for safety photographic film. International Organization for Standardization, Geneva.
  9. Mosweu, O. & Bwalya, K.J. (2024). "The challenges of post custodial management of digital records in Botswana." Information Development, SAGE Journals. DOI: 10.1177/02666669221114867
  10. Murray, K. & Sternfeld, J. (2026). Content Authenticity and Provenance in the Age of Artificial Intelligence: A Call-to-Action for the LAMs Community. C2PA for G+LAM Community of Practice, Library of Congress. Published February 2026. https://blogs.loc.gov/thesignal/files/2026/04/Call-to-Action-CAP-for-LAMs.pdf
  11. National Archives and Records Administration (NARA). (2024). Digital Preservation Strategy 2022–2026. U.S. National Archives. https://www.archives.gov/preservation/digital-preservation/strategy
  12. National Archives and Records Administration (NARA). (2024). National Archives Updates Digital Preservation Framework. September 30, 2024. https://www.archives.gov/news/articles/digital-preservation-framework-update-2024
  13. National Archives and Records Administration (NARA). (2024). "New Strategic Framework Emphasises Building Capacity Through Responsible Use of Artificial Intelligence." October 17, 2024. https://www.archives.gov/news/articles/new-strategic-framework-artificial-intelligence
  14. Observer Research Foundation (ORF). (2026). "Algorithms of Falsehood: The Challenges of Governing AI-Generated Disinformation." April 2026. https://www.orfonline.org/expert-speak/algorithms-of-falsehood-the-challenges-of-governing-ai-generated-disinformation
  15. Rockembach, M. (2024). "AI Literacy: a must for records management and archival professionals." In: Duranti L, Rogers C (Eds.) Artificial Intelligence and Documentary Heritage. SCEaR Newsletter, UNESCO, Special Issue 2024.
  16. Shumailov, I., Shumaylov, Z., Zhao, Y. et al. (2024). "AI models collapse when trained on recursively generated data." Nature, 631, 755–759. DOI: 10.1038/s41586-024-07566-y
  17. Tate. "Archives and Record Management: Records Continuum Theory." Research methodology note. https://www.tate.org.uk/research/reshaping-the-collectible/research-approach-archives-record-management
  18. University of Buffalo / The Conversation. (2025). "2026 will be the year you get fooled by a deepfake, researcher says." Fortune, republished December 27, 2025. https://fortune.com/2025/12/27/2026-deepfakes-outlook-forecast/

Kembali ke blog

Tulis komentar

Ingat, komentar perlu disetujui sebelum dipublikasikan.