Title: | Simple Scraping and Tidy Webpage Summaries |
---|---|
Description: | Simple tools for scraping webpages, extracting common html tags and parsing contents to a tidy, tabular format. Tools help with extraction of page titles, links, images, rss feeds, social media handles and page metadata. |
Authors: | Alastair Rushworth |
Maintainer: | Alastair Rushworth <[email protected]> |
License: | GPL-2 |
Version: | 0.6.0 |
Built: | 2024-12-25 03:10:56 UTC |
Source: | https://github.com/alastairrushworth/htmldf |
From a vector of urls, html_df()
will attempt to fetch the html. From the
html, html_df()
will attempt to look for a page title, rss feeds, images, embedded social media
profile handles and other page metadata. Page language is inferred using the package cld3
which wraps Google's Compact Language Detector 3.
html_df( urlx, max_size = 5e+06, wait = 0, retry_times = 0, time_out = 30, show_progress = TRUE, keep_source = TRUE, chrome_bin = NULL, chrome_args = NULL, ... )
html_df( urlx, max_size = 5e+06, wait = 0, retry_times = 0, time_out = 30, show_progress = TRUE, keep_source = TRUE, chrome_bin = NULL, chrome_args = NULL, ... )
urlx |
A character vector containing urls. Local files must be prepended with |
max_size |
Maximum size in bytes of pages to attempt to parse, defaults to |
wait |
Time in seconds to wait between successive requests. Defaults to 0. |
retry_times |
Number of times to retry a URL after failure. |
time_out |
Time in seconds to wait for |
show_progress |
Logical, defaults to |
keep_source |
Logical argument - whether or not to retain the contents of the page |
chrome_bin |
(Optional) Path to a Chromium install to use Chrome in headless mode for scraping |
chrome_args |
(Optional) Vector of additional command-line arguments to pass to chrome |
... |
Additional arguments to 'httr::GET()'. |
A tibble with columns
url
the original vector of urls provided
title
the page title, if found
lang
inferred page language
url2
the fetched url, this may be different to the original, for example if redirected
links
a list of tibbles of hyperlinks found in <a>
tags
rss
a list of embedded RSS feeds found on the page
tables
a list of tables found on the page in descending order of size, coerced to
tibble
wherever possible.
images
list of tibbles containing image links found on the page
social
list of tibbles containing twitter, linkedin and github user info found on page
code_lang
numeric indicating inferred code language. A negative values near -1
indicates high likelihood that the language is python, positive values near 1 indicate R.
If not code tags are detected, or the language could not be inferred, value is NA
.
size
the size of the downloaded page in bytes
server
the page server
accessed
datetime when the page was accessed
published
page publication or last updated date, if detected
generator
the page generator, if found
status
HTTP status code
source
character string of xml documents. These can each be coerced to xml_document
for further processing using rvest
using xml2:read_html()
.
Alastair Rushworth
# Examples require an internet connection... urlx <- c("https://github.com/alastairrushworth/htmldf", "https://alastairrushworth.github.io/") dl <- html_df(urlx) # preview the dataframe head(dl) # social tags dl$social # page titles dl$title # page language dl$lang # rss feeds dl$rss # inferred code language dl$code_lang # print the page source dl$source
# Examples require an internet connection... urlx <- c("https://github.com/alastairrushworth/htmldf", "https://alastairrushworth.github.io/") dl <- html_df(urlx) # preview the dataframe head(dl) # social tags dl$social # page titles dl$title # page language dl$lang # rss feeds dl$rss # inferred code language dl$code_lang # print the page source dl$source