Spanish Web Archive
The Spanish Web Archive is the collection of websites (including blogs, forums, documents, images, videos, etc.) that are collected in order to preserve the Spanish documentary heritage on the Internet and ensure access to it.
Due to the sheer size of the Internet and the technological means currently available, it is now impossible to aspire to exhaustiveness in web archiving. Therefore, in order to try to save as much information as possible, the National Library of Spain has opted for a mixed model that combines massive and selective harvesting, as other national libraries around the world do.
These collections are available in both the National Library of Spain as well as in the conservation centres of the Autonomous Communities.
Inspired by the UNESCO’s Guidelines for the Preservation of Digital Heritage (2003) and the Commission Recommendation of 24 August 2006 on the digitisation and online accessibility of cultural material and digital preservation, the National Library of Spain began to capture Spanish websites hosted in the .es domain, as well as in other generic domains and subdomains (.com, .edu, .gob, .org, .net, etc.).
From the start of the project in 2009 until the end of 2013, eight broad crawls of the .es domain and two selective harvests were carried out. The first selective harvest was focused on monographic coverage of the General Elections of 20 November 2011 and the second undertook the compilation of Spanish resources in the field of the Humanities. The result of these crawls, made by Internet Archive for the National Library of Spain, was transferred to the Library’s servers at the end of 2014, through a cooperation agreement signed with Red.es. Red.es collaborates actively with the Library in technological and infrastructure development for the management of the legal deposit of online publications.
In 2014 the Library installed in a test environment the open source NetarchiveSuite suite of tools for web crawling and archiving. With this proprietary system the Library has since conducted several selective crawling on events relevant to Spanish history and culture, such as the death of Adolfo Suárez, the abdication of Juan Carlos I, the proclamation of Felipe VI, the European elections in 2014, the Local and Regional 2015 and General Elections of 2015‑2016.
In 2015 Royal Decree 635/2015, of 10th july, which regulates the legal deposit online publications was published, which entered into force on 26 October of that year. This royal decree supports the activity on the preservation of online publications that conservation centres have carried out in recent years, particularly in terms of web archiving projects.
In 2016 the first domain .es broad crawl was carried out with its own resources, which lasted 3 months.
During that year, in addition, cooperation was strengthened between the conservation centres of the Autonomous Communities and the National Library of Spain to manage and build a collaborative online legal deposit. An increasing number of centres are managing their own web collections, using the tools that BNE has made available to all of them.
A web archive is the set of resources collected from the Web over time.
These resources form collections of websites grouped by subject, a topic, an event or risk of disappearance. The harvest is done in an automated way by crawlers that scan the websites, copying and saving all the information. This information is stored, preserved and disseminated through the Spanish Web Archive.
The collections seek to reproduce in detail the aspect of the website and the functionalities available during the harvest, so that the replica of the website is as navigable as its “ live ” version. Once the crawl is complete, the archived websites are displayed at the OpenWayback, an application that enables the user to select and consult a specific version of a particular web.
The websites are previously selected by library staff specialised in the digital heritage preservation. The selection criteria are defined in the Collection Development Policy document.
All the information is stored in a standard file format called WARC (acronym for Web ARChive file format, ISO 28500), which compresses all the information about the collected websites.
The same way as any other bibliographical material, the Library archives websites for a variety of reasons that justify their necessity and usefulness for future generations:
- Content not stored in a web archive will disappear in perpetuity and irretrievably.
- They bear witness to the history of the internet and the creation of websites.
- Study of society and the evolution of customs and ideas.
- Preservation of a country’s online cultural and documentary heritage.
- Storage of ephemeral content with great potential to disappear in the short term.
- Tool for the study and research of events with high representation on the Internet.
- Recovery of the content of deleted or missing websites.
Due to the enormous size of the Internet and technological means currently available, it is now impossible to aspire to completeness in web archiving. Therefore, to try to save as much information as possible, the National Library of Spain has opted for a mixed model which combines full domain and selective harvesting, as other national libraries around the world do.
The tool used by the National Library of Spain for web archiving is called NAS (NetArchiveSuite). This open source application was designed in 2004 by the Royal Library of Denmark and is now also used for this purpose by other national libraries. For crawling it uses the Heritrix robot, created by the Internet Archive, which was the first organization to crawl and archive the Web since 1996. For viewing the archive, it uses OpenWayback, an application created by the International Internet Preservation Consortium (IIPC), which offers the user the possibility to consult a website captured on a given date.
The general selection criteria are based on article 3 of the Royal Decree 635/2015, of 10 july, regulating the legal deposit of online publications, according to which websites subject to legal deposit are those that:
- contain bibliographic, sound, visual, audiovisual or digital heritage of cultures of spain;
- are under the .es domain and associated subdomains, as well as other domains in the national territory;
- are hosted in other domains (.com, .net, .org, .edu, etc.), but contain spanish documentary heritage;
- are in any of the official languages of the State;
- are in any format, including the publications contained therein;
- have both free and restricted access.
There are several categories of web resources that would be advisable to include in website selections in order to make the documentary sample as representative as possible:
- Social media: press headlines, news agencies, radio and television stations.
- Administrative bodies: Ministries, Autonomous Communities, City Councils.
- Political institutions: political parties.
- Cultural institutions: museums, archives, libraries, schools, universities, research centres.
- Scientific institutions.
- Health institutions.
- Sports institutions.
- Websites focusing on natural and artistic heritage.
- Cultural events, congresses, assemblies, conferences…
- Websites of private companies.
- Associations: professionals, NGOs.
- Blogs and websites of relevant people related to the subject of the collection.
- Social networks: Twitter (currently X), Facebook.
- Wikis: Wikipedia.
- Video recordings: YouTube.
There are certain limitations related to legal and issues affecting the collection of online publications.
On the legal side, according to Royal Decree 635/2015, the following are excluded from collections (art. 4):
- Mail and private correspondence.
- Content that is hosted only on a private network.
- Personal data to which only a restricted group of people has access.
In accordance with the provisions of articles 6 and 7 of Royal Decree 635/2015, of 10 July, the NATIONAL LIBRARY OF SPAIN, O.A., exercises its function of capturing and depositing online publications that have been the object of public communication and websites accessible through communications networks. This capture and deposit is carried out without altering the contents in order to guarantee their integrity and historical traceability. Consequently, the BNE is not responsible for those contents that, being part of the capture and deposit, are contrary to the law, morality or public order, being the owners of such communications responsible for them.
From a technical point of view, some content, despite being freely accessible on the Internet, cannot be collected under current technological conditions:
- Databases, repositories, catalogues.
- Interactive reading viewers.
- Streaming content.
- Files in the cloud.
- Content behind filters, dorp-down lists or check boxes.