jdorw
The re-use of names will be tough but there's plenty of tricks I use on my end to handle those cases. We can work on that once there's a working example to test. I think just including a popularity score in the data would be good enough to at least get something started.

I like the polygons on the maps. I was playing around with editing the map for my local climbing spot. Really neat!

Is there a site map or index to start scraping from?

Floey, I think it might be easier if you just want to scrap your small local area to get an initial test data set. I can put that on a server and give us something to start testing out.
posted by jdorw Staff3 years and 4 months ago Link

floey
Which way should I go about scraping my local area? I looked at the few longtails currently made and it didn't seem as if they had any similarities in their scraping.

Do we want to start small or should we pick a massive area like Yosemite as the initial data set?
posted by floey 3 years and 4 months ago Link
jdorw
I'd start small. We probably need a few iterations to figure out what data to get so no point in scraping a big area.

There's not a lot of similarities. That's a good (or bad?) part of these. You can write them in pretty much any language and any way you like.
posted by jdorw Staff3 years and 4 months ago Link
brendanheywood
We don't implement a sitemap.xml and I'm not sure it's practical to - we have several hundred thousand nodes being updated hourly in an irregular way, it would be constantly stale and take a lot of cpu to make. We do implement an atom feed of updates to the index which is discoverable in markup and I *think* google and other robots use it to detect timely changes to nodes to re-index without resorting to a full re-scrape. Hard to know their internals.

But for our purposes I'd assume there is already lots of DDG scraping and indexing going on and this should just be an extra step after that process and not a separate scrape?

If this needs another scrape, or just for dev purposes you could start at the world node, or any region or crag node and then walk down through the index following links in the left nav:

http://www.thecrag.com/climbing/world

Or pick some smaller region like

http://www.thecrag.com/climbing/australi...

Just to get started I'd probably only walk down as far as the highest crag node, we internally call this the TLC or Top Level Crag. We use this concept of a TLC for lots of reasons, eg what 'crag' does a route belong to? Crag's can be nested, ie the Grampians or Yosemite are considered crags but are 100s of km wide and contain smaller well known and names crags, eg Yosemite > El Capitan and Grampians > Hollow Mountain. This will avoid the large number of cases of crags which have children nodes with generic names like 'Left side / Right side', 'North / South' or 'Sunny side / Shady side'. Later we can go down to the route level and figure out how to filter out all the duplicates.

posted by brendanheywood 3 years and 4 months ago Link
jdorw
This would be a completely separate scraper. All the fatheads and longtails use a source specific scraper. Usually we just figure out what a reasonable update period is and manually run it then. I like the atom feed though. I'll try and think of a way we could use that.
posted by jdorw Staff3 years and 4 months ago Link
brendanheywood
Given that this would be a separate scraping process I'm starting to think a custom endpoint on our side which is hit once a week or so would be better, but not a live api.

In the mean time you guys can manually piece together a static text file, csv or json or directly in the longtail format with exactly the bits of data that you need, and adding fields and test data as you go. When you are close to getting it all working and the format has stabilized a bit we (ie thecrag) will knock up a single api endpoint which replicates that format.

That way you are focusing on the IA logic and not on a bunch of scraping and parsing code, and we are focus on just pumping out the data in the right shape without having to think too much about what that shape is and what field goes where.
posted by brendanheywood 3 years and 4 months ago Link
jdorw
How would the api work? Just hit it once a week to get a whole new data set?
posted by jdorw Staff3 years and 4 months ago Link
brendanheywood
Yeah that's what I had in mind, this seems the easiest. Even monthly could be fine as this data is very slow moving at this high level. This would work well for all crags, which is a fairly limited data set, currently ~5,000 records worldwide (and probably return only high quality subset of these). This probably won't work so well for route level stuff ~300,000, but I'm a little dubious about the value of IA's for individual routes. Perhaps we could only IA routes which are iconic, 2 or 3 star popular routes.
posted by brendanheywood 3 years and 4 months ago Link
floey
jdorw, If theCrag will have the custom endpoint, should the IA be a Spice or is the Longtail format still applicable?
posted by floey 3 years and 3 months ago Link
floey
I created a scraper/crawler for the site and here is the sample output in JSON format for my local crag, The Rumbling Bald.

{"Rumbling Bald":{"styles":{"Boulder":"32%","Unknown":"20%","Sport":"2%","Trad":"44%"},"breadcrumbs":["Rumbling Bald","North Carolina","USA","North America"],"areas":[{"routes":"9","ticks":"129","height":"130ft","name":"Cereal Buttress"},{"height":"","name":"Comotose Area","routes":"0","ticks":"0"},{"routes":"0","ticks":"0","name":"Flakeview Area","height":""},{"height":"","name":"Lakeview Area","routes":"0","ticks":"0"},{"height":"69ft","name":"Screamweaver Area","routes":"3","ticks":"6"},{"ticks":"0","routes":"0","height":"","name":"Cereal Wall"}],"type":"crag","url":"http://www.thecrag.com/climbing/united-s...","number of routes":"34"}}

The scraper could scrape a larger area, but I did 1 crag so we would just have a little data first.
posted by floey 3 years and 3 months ago Link
brendanheywood
Is this running from Quebec in Safari 534.34 by any chance and still running right now?

We've seen a massive spike in traffic hitting urls like this:

http://www.thecrag.com/climbing/a/b/area...
http://www.thecrag.com/climbing/a/b/area...
http://www.thecrag.com/climbing/a/b/area...

This is quite odd as that url format, while it works, isn't something you'd ever find by following links, it works almost by accident.

If it is please let me know so I don't have to keep chasing this traffic, and please ramp it down a *lot*. Ideally just scrape the page without any assets (ie don't load our google analytics beacon)
posted by brendanheywood 3 years and 3 months ago Link
floey
That is quite odd. I haven't been running anything since I posted and my traffic was minimal and confined to that single url. I don't live in Quebec or have safari installed either. I do hope you figure out what the problem is though.
posted by floey 3 years and 3 months ago Link
jdorw
Great thanks! All of the data that you want to show will have to go into a single "paragraph" field. This field will have to be formatted in the way you want it to show on the site. For now that can only be plain text and newlines.

Feel free to make a pull request at any time. It doesn't have to be completely finished and it might be easier to go over the output file format on github.
posted by jdorw Staff3 years and 3 months ago Link
floey
Pull request has been made:
https://github.com/duckduckgo/zeroclicki...
posted by floey 3 years and 3 months ago Link
brendanheywood
hey I know this isn't the right forum, but I find these forums fairly difficult to follow what is new. In particular two small low hanging fruit would be for the notification emails and the notifications list on the dashboard to link to the hash anchor of each new comment. And secondly if it could highlight which comments where new since you last visited a page that would make a massive difference. It's very easy to lose track and context in these deeply nested threads
posted by brendanheywood 3 years and 4 months ago Link
jdorw
Agreed. Better notifications are in progress. I'll make an issue for the comment highlighting.
posted by jdorw Staff3 years and 4 months ago Link