Jump to content

What is the easiest way to scrape websites to put into a script filter?


Recommended Posts

I wanted to make a list filter for Project Euler (https://projecteuler.net/archives) that would effectively give me a list of all the problems that exist on the site in a list.

 

I can make a list filter and manually create a link and an appropriate title and put it under one list filter. Kind of how I have done here (https://github.com/nikitavoloboev/alfred-ask-create-share). However this is very time consuming. Project Euler also doesn't offer any API.

 

Can someone recommend a way where I can get all of these problem titles and their respective links : 

 

z4DbmAN.png

 

Under one script filter.

 

Thank you for any help. 

Link to comment

There is no “easiest way”. Each site is different and as such will require a different approach. The easiest way is the one easiest for you.


I’m a fan of Nokogiri, as I tend to use ruby for this. This workflow should do what you want. It currently only retrieves the first page of results.


For when the link expires, this is the code:

require 'json'
require 'nokogiri'

base_url = 'https://projecteuler.net'
archives_url = "#{base_url}/archives"
problem_url = "#{base_url}/problem="

table = Nokogiri::HTML(%x(curl "#{archives_url}")).at('#problems_table')
table.at('tr').remove

script_filter_items = []

table.css('tr').each do |exercise|
  number = exercise.at('td').text
  name = exercise.css('td')[1].text
  solvers = exercise.css('td')[2].text

  url = "#{problem_url}#{number}"

  script_filter_items.push(title: name, subtitle: "#{solvers} people solved this", arg: url)
end

puts ({ items: script_filter_items }.to_json)

I do realise %x(curl "#{archives_url}") is a tad ridiculous, but the system ruby was giving me an SSL error for this site (latest ruby does not) so I did that as a quick patch.

Edited by vitor
Link to comment

Thank you a lot, @vitor. I will try to extend it to give a list of all the problems from all webpages. Also I am really fan of the approach of downloading these hyperlinks to some local sql database and then just reading from the database. Like this workflow does (https://github.com/lox/alfred-github-jump). 

 

Also I have a question. You use transfer.sh to share files and workflows. Do you have any handy workflows built for it? :)

 

Like for example a file filter that will take a folder/file and will give a transfer.sh link.

Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...