API for fast local and online link checking

get_links(fn)

List of all links in file fn

We can use get_links to parse an HTML file for different types of links. For example, this is the contents of ./example/broken_links/test.html:

example = Path('_example/broken_links/test.html')
print(example.read_text())
<a href="//somecdn.com/doesntexist.html"></a>
<a href="http://www.bing.com"></a>
<script src="test.js"></script>
<img src="http://fastlinkcheck.com/test.html" />
<script src="/test"></script>

Calling get_links with the above file path will return a list of links:

links = get_links(example)
test_eq(set(links), {'test.js',
                     '//somecdn.com/doesntexist.html',
                     'http://www.bing.com','http://fastlinkcheck.com/test.html',
                     '/test'})

local_urls[source]

local_urls(path:Path, host:str)

returns a dict mapping all HTML files in path to a list of locally-resolved links in that file

The keys of the dict returned by local_urls are links found in HTML files, and the values of this dict are a list of paths that those links are found in.

Furthermore, local links are returned as Path objects, whereas external URLs are strings. For example, notice how the link:

http://fastlinkcheck.com/test.html

is resolved to a local path, because the host parameter supplied to local_urls, fastlinkcheck.com matches the url in the link:

path = Path('./_example/broken_links/')
links = local_urls(path, host='fastlinkcheck.com')
links
  • Path('/Users/hamelsmu/github/fastlinkcheck/_example/broken_links/test.js') was found in the following pages:

    • /Users/hamelsmu/github/fastlinkcheck/_example/broken_links/test.html
  • Path('/Users/hamelsmu/github/fastlinkcheck/_example/broken_links/test.html') was found in the following pages:

    • /Users/hamelsmu/github/fastlinkcheck/_example/broken_links/test.html
  • Path('/Users/hamelsmu/github/fastlinkcheck/_example/broken_links/test') was found in the following pages:

    • /Users/hamelsmu/github/fastlinkcheck/_example/broken_links/test.html
  • 'http://www.bing.com' was found in the following pages:

    • /Users/hamelsmu/github/fastlinkcheck/_example/broken_links/test.html
  • 'http://somecdn.com/doesntexist.html' was found in the following pages:

    • /Users/hamelsmu/github/fastlinkcheck/_example/broken_links/test.html

html_exists[source]

html_exists(o)

If a path without a suffix is proivded, see if the same path with a .html suffix exists

the path _example/broken_links/test doesn't exist, but _example/broken_links/test.html does:

p = Path("_example/broken_links/test")

assert not p.exists()
assert html_exists(p)

the path _example/broken_links/really_doesnt_exist and neither does _example/broken_links/really_doesnt_exist.html:

p = Path("_example/broken_links/really_doesnt_exist")
assert not p.exists()
assert not html_exists(p)

Since test.js does not exist in the example/ directory, broken_local returns this path:

broken_local[source]

broken_local(links, ignore_paths=None)

List of items in keys of links that are Paths that do not exist

broken_local(links)
(#1) [Path('/Users/hamelsmu/github/fastlinkcheck/_example/broken_links/test.js')]
assert not all([x.exists() for x in broken_local(links)])

broken_urls[source]

broken_urls(links, ignore_urls=None)

List of items in keys of links that are URLs that return a failure status code

Similarly the url http://somecdn.com/doesntexist.html doesn't exist, which is why it is returned by broken_urls

assert broken_urls(links) == ['http://somecdn.com/doesntexist.html']

link_check(path:"Root directory searched recursively for HTML files", host:"Host and path (without protocol) of web server"=None, config_file:"Location of file with urls to ignore"=None)

Check for broken links recursively in path.

link_check(path='_example/broken_links/', host='fastlinkcheck.com')
  • 'http://somecdn.com/doesntexist.html' was found in the following pages:

    • /Users/hamelsmu/github/fastlinkcheck/_example/broken_links/test.html
  • Path('/Users/hamelsmu/github/fastlinkcheck/_example/broken_links/test.js') was found in the following pages:

    • /Users/hamelsmu/github/fastlinkcheck/_example/broken_links/test.html

Similarly if there are no broken links, link_check will not return any data. In this case, there are no broken links in the directory _example/no_broken_links/:

assert not link_check(path='_example/no_broken_links/')

You can choose to ignore files with a a plain-text file containing a list of urls to ignore. For example, the file linkcheck.rc contains a list of urls I want to ignore:

print((path/'linkcheck.rc').read_text())
test.js
https://www.google.com

In this case example/test.js will be filtered out from the list:

link_check(path='_example/broken_links/', host='fastlinkcheck.com', config_file='_example/broken_links/linkcheck.rc')

link_check can also be called use from the command line like this:

The -h or --help flag will allow you to see the command line docs:

!link_check -h
usage: link_check [-h] [--host HOST] [--config_file CONFIG_FILE] [--pdb]
                  [--xtra XTRA]
                  path

Check for broken links recursively in `path`.

positional arguments:
  path                  Root directory searched recursively for HTML files

optional arguments:
  -h, --help            show this help message and exit
  --host HOST           Host and path (without protocol) of web server
  --config_file CONFIG_FILE
                        Location of file with urls to ignore
  --pdb                 Run in pdb debugger (default: False)
  --xtra XTRA           Parse for additional args (default: '')