scraper module

scraper.add_repository(name=None, url=None, print_result=True)

Stores the metadata of a repository into the MongoDB database.

Parameters
  • name – (opt) The screen name of the repository.

  • url – (opt) The URL to the repository.

  • print_result – (opt) Boolean if the result of the add operation shall be printed or not.

Returns

The new repo UUID.

scraper.delete_repository(uuid)

Deletes a repository.

Parameters

uuid – The UUID of the repository to be deleted.

Returns

Whether or not the deletion was successful (boolean).

scraper.download_files(uuid)

Downloads every single file from a given repository.

Parameters

uuid – The UUID of the repository to download files from.

Returns

The duration in seconds (float).

scraper.download_repository(uuid=None)

Starts downloading data of a specific repository.

Parameters

uuid – The UUID of the repository. If no UUID is given, you have to choose from the available ones.

Returns

The time that the process took in seconds (float).

scraper.get_package_count(url)

Gets the number of packages available in a CKAN repository.

Parameters

url – The url to the CKAN repository.

Returns

The number of packages.

scraper.scrape_chunk(repository_uuid, rows, start)

Fetches a whole chunk of documents.

Parameters
  • url – The url to the CKAN repository.

  • rows – The chunk size.

  • start – The offset.

Returns

The fetched links, as a list.

scraper.scrape_urls(repository_uuid)

Fetches a repository.

Parameters

uuid – The UUID of the repository to scrape.

Returns

The number of downloadable resources.