Saving/Downloading Threads as PDF, Markdown etc?

thresholdpeople · December 19, 2021, 4:58pm

I’ve been digging around for a good way to archive thread topics outside of the Bookmark feature.

The file print / save as PDF method isn’t really great as all the code blocks don’t wrap or extend, so stuff gets clipped.

Cruising around the discourse developer forum it seems like there’s a way to view the raw markdown
eg- https://scsynth.org/raw/threadNumber but this only shows one thread at a time, with no author attribution, etc. Kind of clunky to iterate over.

Reading through the discourse thread on how they set up printing/saving as PDF it seems like it’s not much of a feature they care about supporting - getting info out of the platform - which is such a shame, but I digress. Just wondering if someone’s found a good workflow to archive? My current workaround has been copy/pasting into text files which is not ideal… the raw markdown seems a bit more promising, but also not great.

scztt · December 20, 2021, 10:11am

The best I can figure out is to append /print to a conversation url (Saving/Downloading Threads as PDF, Markdown etc?). This gives you a relatively static, printable HTML page. You can print from there, or e.g. grab it via Pocket or some other kind of “archive-for-reading” app. I didn’t see bad problems with code formatting, but if you DO see formatting things, the best course is probably to edit the css (either in developer mode of a browser, or in the saved files).

bovil43810 · December 20, 2021, 12:29pm

This thread on the discourse meta forum might be a good starting point if you need to archive a bunch of threads (or all of scsynth.org) at once - some of the options mentioned are scraping using requests/BeautifulSoup, httrack and creating searchable WACZ files from wget archives. For single threads, I second @scztt’s suggestion of appending /print or ?_escaped_fragment_ to a thread’s URL and editing the CSS to adjust the formatting.

thresholdpeople · December 20, 2021, 3:31pm

Ah I didn’t think to edit the CSS when using /print and also rather than cmd + p, appending /print like that does make it easier. But for any code block that has long line lengths or more than a certain amount of lines, it does get cropped, for instance - Time-aware merging of two Event Pattern streams. I’ll mess with the CSS to see if there’s a relatively simple 2-3 step process.

I was reading about httrack, but only briefly, and didn’t really keep going when I saw something about Javascript needing to be deactivated. That Jupyter notebook/BeautifulSoup scraper you mention @bovil43810 seems promising.

bovil43810 · December 20, 2021, 3:51pm

Quick and ugly fix for the cropping issue:

pre code {
    white-space: pre-wrap;
    max-height: none;
}

thresholdpeople · December 20, 2021, 4:10pm

That one doesn’t seem to be working for me, or really I’m not sure exactly where to stick it.

I’ve also been running into this error while using the print method- navigate to a URL, append /print and the page that crops up is:

{"errors":["You’ve performed this action too many times, please try again later."]}

bovil43810 · December 20, 2021, 4:26pm

I was just messing around with a bit of CSS that makes the text inside of code tags wrap properly and removes the vertical scrollbars. What I did was

Append ?_escaped_fragment_ to thread URL (this circumvents the error message you’re getting with /print, I think)
Open developer tools within chrome (hit F12)
Copy the following text: pre code { white-space: pre-wrap; max-height: none; }
Within the <head>...</head> tag, locate the last <style>...</style> tag
Right click → Duplicate element, replace the text inside by pasting
Hit Ctrl-P and save as PDF

If you don’t want to do this every time, you could probably use Stylish or some other extension that changes the CSS on the fly.

thresholdpeople · December 20, 2021, 4:37pm

I’ll give this a try. Thank you!

thresholdpeople · January 30, 2022, 7:08pm

Just wanted to say that Stylish and your recommended code is working great @bovil43810. Thanks again for your help!

It also seems that the latest version of discourse has a raw view for the entire thread now, instead of just a single post. Export topic as markdown - #15 by Falco - feature - Discourse Meta

Would it be possible to update this site?