Full domain crawls


Broad crawls aim to harvest an entire domain, without exclusions or selection. The harvesting robot, provided with the list of registered domains, crawls them all and archives the content, according to a pre-determined configuration.

In broad crawls there is no selection by librarians, which eliminates any filtering, bias or subjectivity.

The Spanish Web Archive currently harvests the four national domains: .es, .gal, .cat, .eus. It does this once a year with the collaboration of the different regional conservation centres. It also carries out a massive annual storage of serial publications in free access on the Internet.

The contents saved in the full domain crawls can only be consulted via URL.


Archived websites


Crawl of .es domain

The crawl of .es domain is done annually since 2009 and seeking to obtain an overview of the site of the country each year.

Between 2009 and 2013 were carried out massive additional 8 .es domain through the Internet infrastructure Archive. These are the additional representation of the oldest preserved websites in the file of the Spanish Website.

In 2014, the national library of Spain acquires its own infrastructure of crawl and after a trial period in 2016 runs for the first time the bulk crawl .es domain themselves. To perform this collection Red.es Registrar draws previously and provides domain list on the basis of the comprehensive listing of all domains registered in ESNIC. This was the first collection saved 800,000 domains, with a limit of 100 MB in size with a total of 28 TB.

Currently the massive .es domain is done on a yearly basis, and observed around 2,000,000 domains, with a maximum size of 150 MB and storing around 70 TB, obtaining information save more than 80 per cent of domains.

Crawl of .gal domain

The collection of .gal domain takes place annually from 2017 in collaboration with the library of Galicia, which provides the galician domain list from the entity PuntoGal.

The first crawls is more than 4,000 domains, with 150 MB of depth and a total of 140 GB of memory stored. Currently, more than 6,000 domains and 280 GB of information.

Web archiving saves catches of Spanish first website that took this http://www.dominio.gal domain:

The .cat domain

In 2002 the national library of Spain the .cat domain harvest in collaboration with the library of catalonia, which provides a list of domains catalans extracted from the puntCat Foundation.

The first crawl of capture 44,000 .cat domain domains, with a limit of 150 MB in size by domain and get download 77 per cent of them completely, reached 2.5 TB of stored information.

.Eus collection

In 2023 the national library of Spain the .eus domain harvest in collaboration with the basque country Digital Library, which provides a list of domains to collect basques extracted from the PuntuEUS association, in this domain.

The first bulk crawl  covers more than 13,000 domains and 750 GB of information, with a maximum size of 150 MB per domain.

Bulk crawls of serials in open access

The electronic serial, especially the magazines, are without doubt, one of the contents more short-lived on the Internet. every year are created and disappear serials, many of them without equivalent printed, so that their disappearance is final and its recovery impossible.

Web archiving of spain would like to make, massively and systematically, the websites of these electronic serials in free access, including digital items that they contain, to be preserved and are accessible on the future while disappear from the site live.

The BNE as the national centre of the ISSN in its catalogue all spanish electronic serials in this international number and this is where he draws the URL and domain for lazar this massive collection.

The first serial saving was conducted in 2020 with more than 8,000 serial URL covering more than 3,700 dominance and a maximum size of a 1 GB per domain. Currently are more than 10,000 serial websites, belonging to 7,000 domains to outstrip 5 TB of stored information.