brendanheywood
Given that this would be a separate scraping process I'm starting to think a custom endpoint on our side which is hit once a week or so would be better, but not a live api.

In the mean time you guys can manually piece together a static text file, csv or json or directly in the longtail format with exactly the bits of data that you need, and adding fields and test data as you go. When you are close to getting it all working and the format has stabilized a bit we (ie thecrag) will knock up a single api endpoint which replicates that format.

That way you are focusing on the IA logic and not on a bunch of scraping and parsing code, and we are focus on just pumping out the data in the right shape without having to think too much about what that shape is and what field goes where.
posted by brendanheywood 3 years and 4 months ago Link

jdorw
How would the api work? Just hit it once a week to get a whole new data set?
posted by jdorw Staff3 years and 4 months ago Link
brendanheywood
Yeah that's what I had in mind, this seems the easiest. Even monthly could be fine as this data is very slow moving at this high level. This would work well for all crags, which is a fairly limited data set, currently ~5,000 records worldwide (and probably return only high quality subset of these). This probably won't work so well for route level stuff ~300,000, but I'm a little dubious about the value of IA's for individual routes. Perhaps we could only IA routes which are iconic, 2 or 3 star popular routes.
posted by brendanheywood 3 years and 4 months ago Link
floey
jdorw, If theCrag will have the custom endpoint, should the IA be a Spice or is the Longtail format still applicable?
posted by floey 3 years and 3 months ago Link
floey
I created a scraper/crawler for the site and here is the sample output in JSON format for my local crag, The Rumbling Bald.

{"Rumbling Bald":{"styles":{"Boulder":"32%","Unknown":"20%","Sport":"2%","Trad":"44%"},"breadcrumbs":["Rumbling Bald","North Carolina","USA","North America"],"areas":[{"routes":"9","ticks":"129","height":"130ft","name":"Cereal Buttress"},{"height":"","name":"Comotose Area","routes":"0","ticks":"0"},{"routes":"0","ticks":"0","name":"Flakeview Area","height":""},{"height":"","name":"Lakeview Area","routes":"0","ticks":"0"},{"height":"69ft","name":"Screamweaver Area","routes":"3","ticks":"6"},{"ticks":"0","routes":"0","height":"","name":"Cereal Wall"}],"type":"crag","url":"http://www.thecrag.com/climbing/united-s...","number of routes":"34"}}

The scraper could scrape a larger area, but I did 1 crag so we would just have a little data first.
posted by floey 3 years and 3 months ago Link
brendanheywood
Is this running from Quebec in Safari 534.34 by any chance and still running right now?

We've seen a massive spike in traffic hitting urls like this:

http://www.thecrag.com/climbing/a/b/area...
http://www.thecrag.com/climbing/a/b/area...
http://www.thecrag.com/climbing/a/b/area...

This is quite odd as that url format, while it works, isn't something you'd ever find by following links, it works almost by accident.

If it is please let me know so I don't have to keep chasing this traffic, and please ramp it down a *lot*. Ideally just scrape the page without any assets (ie don't load our google analytics beacon)
posted by brendanheywood 3 years and 3 months ago Link
floey
That is quite odd. I haven't been running anything since I posted and my traffic was minimal and confined to that single url. I don't live in Quebec or have safari installed either. I do hope you figure out what the problem is though.
posted by floey 3 years and 3 months ago Link
jdorw
Great thanks! All of the data that you want to show will have to go into a single "paragraph" field. This field will have to be formatted in the way you want it to show on the site. For now that can only be plain text and newlines.

Feel free to make a pull request at any time. It doesn't have to be completely finished and it might be easier to go over the output file format on github.
posted by jdorw Staff3 years and 3 months ago Link
floey
Pull request has been made:
https://github.com/duckduckgo/zeroclicki...
posted by floey 3 years and 3 months ago Link