Spanish Website Archive
The Spanish Web Archive is a collection created by websites (including blogs, forums, documents, images, videos, etc.) to preserve and ensure access to online Spanish documentary heritage.
Due to the sheer size of the Internet and the technological means currently available, at present it is impossible to comprehensively document all existing webarchive files. For this reason, with a view to saving as much online information as possible, the Spanish National Library, like other national libraries around the world, uses a mixed model that combines bulk and selective captures.
Websites are captured by crawler robots that scan previously-selected URLs and save all the information that they find linked, with a determined frequency, depth and size. These website crawls produce webarchive files, where all the information collected is saved and can be consulted.
To archive websites, the Spanish National Library uses NAS (NetarchiveSuite), an open-code tool designed by the National Library of Denmark, which is currently used for the same purpose by other national libraries such as those of France and Austria. To track websites the tool uses the Heritrix robot, created by Internet Archive, the first company to start tracking and archiving websites in 1996.
Crawls aim to reproduce the layout of the website in detail, along with the functionalities available at the time of capture, in such a way that the website mirror can be browsed as if it were the “live” version. Once the tracking has been completed, the archived websites are viewed on OpenWayback, an application that lets users select what particular version of a specific website they wish to consult.
Based on the UNESCO Charter on the Preservation of Digital Heritage (2003) and the European Commission's Recommendation on the digitisation and online accessibility of cultural material and digital preservation, the BNE started to capture Spanish websites and pages under the .es top-level domain, as well as other top-level domains and subdomains (.com, .edu, .gob, .org, .net, etc.).
From the launch of the BNE project in 2009 to the end of 2013, there were eight bulk web crawls performed on the .es domain, and two selective crawls. The aim of the first selective crawl was to give monographic coverage of the General Election of 20 November 2011, and the second undertook to gather Spanish resources in the field of the Humanities. The result of these crawls, carried out by Internet Archive for the BNE, was transferred to the Biblioteca's servers at the end of 2014, thanks to a cooperation agreement signed with Red.es. Red.es cooperates actively with the National Library to develop technology and infrastructure to manage the legal deposit of online publications.
In 2014, the Library installed the open-code NetarchiveSuite tool in a test environment to track and archive the web. With this system, the Library has since carried out various selective crawls on relevant events for Spanish history and culture, such as the death of Adolfo Suárez, the abdication of Juan Carlos I, the proclamation of Felipe VI, the 2014 European elections, the 2015 local and regional elections and the 2015-2016 General Elections.
In 2015, following a long period of preparation, Royal Decree 635/2015, of 10 July, was published, regulating the legal deposit of online publications. This decree came into force on 26 October of the same year. This Royal Decree supports the activity relating to the preservation of online publications that the conservation centres have carried out in recent years, and in particular, activities relating to website archiving projects.
In 2016 the first bulk crawl of the .es domain was carried out with internal resources. It lasted 3 months. That year also saw strengthened cooperations between Conservation centres in the Autonomous Communities and the BNE in the management and development of a collaborative legal deposit of online publications. Increasingly more centres are managing their own collections, using the tools that the BNE has made available to them.
The bulk crawls track an entire domain and provide a static snapshot of the web at a specific point in time.
The Autonomous Communities have appointed conservation centres for the management of the legal deposit online publications and prepare thematic collections with the resources that they consider it necessary to preserve as part of the legal deposit of its Area of competence.
On developments of special relevance to the spanish society
Of particular relevance for its social and political value to the spanish society today and in the future.
This is crawl emergency in the case of risk of extinction of websites.