We can use get_links
to parse an HTML file for different types of links. For example, this is the contents of ./example/broken_links/test.html
:
example = Path('_example/broken_links/test.html')
print(example.read_text())
Calling get_links
with the above file path will return a list of links:
links = get_links(example)
test_eq(set(links), {'test.js',
'//somecdn.com/doesntexist.html',
'http://www.bing.com','http://fastlinkcheck.com/test.html',
'/test'})
The keys of the dict
returned by local_urls
are links found in HTML files, and the values of this dict
are a list of paths that those links are found in.
Furthermore, local links are returned as Path
objects, whereas external URLs are strings. For example, notice how the link:
http://fastlinkcheck.com/test.html
is resolved to a local path, because the host
parameter supplied to local_urls
, fastlinkcheck.com
matches the url in the link:
path = Path('./_example/broken_links/')
links = local_urls(path, host='fastlinkcheck.com')
links
the path _example/broken_links/test
doesn't exist, but _example/broken_links/test.html
does:
p = Path("_example/broken_links/test")
assert not p.exists()
assert html_exists(p)
the path _example/broken_links/really_doesnt_exist
and neither does _example/broken_links/really_doesnt_exist.html
:
p = Path("_example/broken_links/really_doesnt_exist")
assert not p.exists()
assert not html_exists(p)
Since test.js
does not exist in the example/
directory, broken_local
returns this path:
broken_local(links)
assert not all([x.exists() for x in broken_local(links)])
Similarly the url http://somecdn.com/doesntexist.html
doesn't exist, which is why it is returned by broken_urls
assert broken_urls(links) == ['http://somecdn.com/doesntexist.html']
link_check(path='_example/broken_links/', host='fastlinkcheck.com')
Similarly if there are no broken links, link_check
will not return any data. In this case, there are no broken links in the directory _example/no_broken_links/
:
assert not link_check(path='_example/no_broken_links/')
You can choose to ignore files with a a plain-text file containing a list of urls to ignore. For example, the file linkcheck.rc
contains a list of urls I want to ignore:
print((path/'linkcheck.rc').read_text())
In this case example/test.js
will be filtered out from the list:
link_check(path='_example/broken_links/', host='fastlinkcheck.com', config_file='_example/broken_links/linkcheck.rc')
link_check
can also be called use from the command line like this:
!
command in Jupyter allows you run shell commands-h
or --help
flag will allow you to see the command line docs:
!link_check -h