Find links in an HTML file

We can use get_links to parse an HTML file for different types of links. For example, this is the contents of ./example/broken_links/test.html:

example = Path('_example/broken_links/test.html')
print(example.read_text())

<a href="//somecdn.com/doesntexist.html"></a>
<a href="http://www.bing.com"></a>
<script src="test.js"></script>
<img src="http://fastlinkcheck.com/test.html" />
<script src="/test"></script>

Calling get_links with the above file path will return a list of links:

links = get_links(example)
test_eq(set(links), {'test.js',
                     '//somecdn.com/doesntexist.html',
                     'http://www.bing.com','http://fastlinkcheck.com/test.html',
                     '/test'})

The keys of the dict returned by local_urls are links found in HTML files, and the values of this dict are a list of paths that those links are found in.

Furthermore, local links are returned as Path objects, whereas external URLs are strings. For example, notice how the link:

http://fastlinkcheck.com/test.html

is resolved to a local path, because the host parameter supplied to local_urls, fastlinkcheck.com matches the url in the link:

path = Path('./_example/broken_links/')
links = local_urls(path, host='fastlinkcheck.com')
links

Finding broken links

the path _example/broken_links/test doesn't exist, but _example/broken_links/test.html does:

p = Path("_example/broken_links/test")

assert not p.exists()
assert html_exists(p)

the path _example/broken_links/really_doesnt_exist and neither does _example/broken_links/really_doesnt_exist.html:

p = Path("_example/broken_links/really_doesnt_exist")
assert not p.exists()
assert not html_exists(p)

Since test.js does not exist in the example/ directory, broken_local returns this path:

broken_local(links)

(#1) [Path('/Users/hamelsmu/github/fastlinkcheck/_example/broken_links/test.js')]

assert not all([x.exists() for x in broken_local(links)])

Similarly the url http://somecdn.com/doesntexist.html doesn't exist, which is why it is returned by broken_urls

assert broken_urls(links) == ['http://somecdn.com/doesntexist.html']

link_check(path='_example/broken_links/', host='fastlinkcheck.com')

Similarly if there are no broken links, link_check will not return any data. In this case, there are no broken links in the directory _example/no_broken_links/:

assert not link_check(path='_example/no_broken_links/')

Ignore links with a configuration file

You can choose to ignore files with a a plain-text file containing a list of urls to ignore. For example, the file linkcheck.rc contains a list of urls I want to ignore:

print((path/'linkcheck.rc').read_text())

test.js
https://www.google.com

In this case example/test.js will be filtered out from the list:

link_check(path='_example/broken_links/', host='fastlinkcheck.com', config_file='_example/broken_links/linkcheck.rc')

link_check can also be called use from the command line like this:

Note: the ! command in Jupyter allows you run shell commands

The -h or --help flag will allow you to see the command line docs:

!link_check -h

usage: link_check [-h] [--host HOST] [--config_file CONFIG_FILE] [--pdb]
                  [--xtra XTRA]
                  path

Check for broken links recursively in `path`.

positional arguments:
  path                  Root directory searched recursively for HTML files

optional arguments:
  -h, --help            show this help message and exit
  --host HOST           Host and path (without protocol) of web server
  --config_file CONFIG_FILE
                        Location of file with urls to ignore
  --pdb                 Run in pdb debugger (default: False)
  --xtra XTRA           Parse for additional args (default: '')

fastlinkcheck API

Find links in an HTML file

`get_links`[source]

`local_urls`[source]

Finding broken links

`html_exists`[source]

`broken_local`[source]

`broken_urls`[source]

`link_check`[source]

Ignore links with a configuration file

fastlinkcheck API

Find links in an HTML file

get_links[source]

local_urls[source]

Finding broken links

html_exists[source]

broken_local[source]

broken_urls[source]

link_check[source]

Ignore links with a configuration file

`get_links`[source]

`local_urls`[source]

`html_exists`[source]

`broken_local`[source]

`broken_urls`[source]

`link_check`[source]