In a few projects in the past I needed to do web scraping to get some data from websites that did not offer access via an API. I was using C#
at the time and scraping web with Html Agility Pack was quite easy.
I now spend most of my time in macOS because of work projects so when I needed to do some web scraping again I did not want to install and set up Mono
to do it again in C#
. I decided to go with Swift
, as I am now quite comfortable with the language after 4 years of using it daily.
SwiftSoup
The first thing I need to do was to found some library to parse HTML, some Swift
equivalent to Html Agility Pack
. I found SwiftSoup.
SwiftSoup
allows you to access DOM
in HTML
documents and also HTML
fragments. The usage is quite simple, you just need to know a thing or two about HTML
.
Example
Let’s say you want to parse the Hacker News main page and scrap posts containing some specific keywords.
This is quite an artificial example but the idea is simple. You use the developer tools in your browser of choice to see the HTML
of the parts of a website that you are interested in and try to get to them descending and filtering the DOM
using SwiftSoup
.
You first need to read the website and parse it
let content = try String(contentsOf: URL(string: "https://news.ycombinator.com/")!)
let doc: Document = try SwiftSoup.parse(content)
Looking at the HTML
you can see it uses a table layout and all the posts are in a rows of a table with a class called itemlist
.