"Converting HTML to text" in Liu et al. (2023)

@liu2023evaluating

Companion repo for "Evaluating Verifiability in Generative Search Engines".

Converting HTML to text


To extract text from HTML pages, we first used [`single-filez`](https://github.com/gildas-lormeau/single-filez-cli) to download cited webpages and their associated assets (e.g., CSS and images). Then, we use the [Chrome DOM Distiller](https://github.com/chromium/dom-distiller) to extract the "readable" portion of the page (this is the view that appears when you use "Reader Mode" in the Chrome browswer). Finally, we used [Trafilatura](https://trafilatura.readthedocs.io/en/latest/) to extract the text from the DOM-distilled HTML.

Converted to markdown with [euangoddard.github.io/clipboard2markdown/](https://euangoddard.github.io/clipboard2markdown/).