API for fast local and online link checking

get_links(fn)

List of all links in file fn

We can use get_links to parse an HTML file for different types of links. For example, this is the contents of ./example/broken_links/test.html:

example = Path('_example/broken_links/test.html')

<a href="//somecdn.com/doesntexist.html"></a>
<a href="http://www.bing.com"></a>
<script src="test.js"></script>
<script src="/test"></script>


Calling get_links with the above file path will return a list of links:

links = get_links(example)
'//somecdn.com/doesntexist.html',
'/test'})


local_urls[source]

local_urls(path:Path, host:str)

returns a dict mapping all HTML files in path to a list of locally-resolved links in that file

The keys of the dict returned by local_urls are links found in HTML files, and the values of this dict are a list of paths that those links are found in.

Furthermore, local links are returned as Path objects, whereas external URLs are strings. For example, notice how the link:

http://fastlinkcheck.com/test.html

is resolved to a local path, because the host parameter supplied to local_urls, fastlinkcheck.com matches the url in the link:

path = Path('./_example/broken_links/')


• /Users/hamelsmu/github/fastlinkcheck/_example/broken_links/test.html

• /Users/hamelsmu/github/fastlinkcheck/_example/broken_links/test.html

• /Users/hamelsmu/github/fastlinkcheck/_example/broken_links/test.html
• 'http://www.bing.com' was found in the following pages:

• /Users/hamelsmu/github/fastlinkcheck/_example/broken_links/test.html
• 'http://somecdn.com/doesntexist.html' was found in the following pages:

• /Users/hamelsmu/github/fastlinkcheck/_example/broken_links/test.html

html_exists[source]

html_exists(o)

If a path without a suffix is proivded, see if the same path with a .html suffix exists

the path _example/broken_links/test doesn't exist, but _example/broken_links/test.html does:

p = Path("_example/broken_links/test")

assert not p.exists()
assert html_exists(p)


the path _example/broken_links/really_doesnt_exist and neither does _example/broken_links/really_doesnt_exist.html:

p = Path("_example/broken_links/really_doesnt_exist")
assert not p.exists()
assert not html_exists(p)


Since test.js does not exist in the example/ directory, broken_local returns this path:

broken_local[source]

broken_local(links, ignore_paths=None)

List of items in keys of links that are Paths that do not exist

broken_local(links)

(#1) [Path('/Users/hamelsmu/github/fastlinkcheck/_example/broken_links/test.js')]
assert not all([x.exists() for x in broken_local(links)])


broken_urls[source]

broken_urls(links, ignore_urls=None)

List of items in keys of links that are URLs that return a failure status code

Similarly the url http://somecdn.com/doesntexist.html doesn't exist, which is why it is returned by broken_urls

assert broken_urls(links) == ['http://somecdn.com/doesntexist.html']


link_check(path:"Root directory searched recursively for HTML files", host:"Host and path (without protocol) of web server"=None, config_file:"Location of file with urls to ignore"=None)

Check for broken links recursively in path.

link_check(path='_example/broken_links/', host='fastlinkcheck.com')

• 'http://somecdn.com/doesntexist.html' was found in the following pages:

• /Users/hamelsmu/github/fastlinkcheck/_example/broken_links/test.html

• /Users/hamelsmu/github/fastlinkcheck/_example/broken_links/test.html

Similarly if there are no broken links, link_check will not return any data. In this case, there are no broken links in the directory _example/no_broken_links/:

assert not link_check(path='_example/no_broken_links/')


You can choose to ignore files with a a plain-text file containing a list of urls to ignore. For example, the file linkcheck.rc contains a list of urls I want to ignore:

print((path/'linkcheck.rc').read_text())

test.js



In this case example/test.js will be filtered out from the list:

link_check(path='_example/broken_links/', host='fastlinkcheck.com', config_file='_example/broken_links/linkcheck.rc')


link_check can also be called use from the command line like this:

The -h or --help flag will allow you to see the command line docs:

!link_check -h

usage: link_check [-h] [--host HOST] [--config_file CONFIG_FILE] [--pdb]
[--xtra XTRA]
path

Check for broken links recursively in path.

positional arguments:
path                  Root directory searched recursively for HTML files

optional arguments:
-h, --help            show this help message and exit
--host HOST           Host and path (without protocol) of web server
--config_file CONFIG_FILE
Location of file with urls to ignore
--pdb                 Run in pdb debugger (default: False)
--xtra XTRA           Parse for additional args (default: '')