"Converting HTML to text" in Liu et al. (2023)
Companion repo for "Evaluating Verifiability in Generative Search Engines".
Converting HTML to text
To extract text from HTML pages, we first used [`single-filez`](https://github.com/gildas-lormeau/single-filez-cli) to download cited webpages and their associated assets (e.g., CSS and images). Then, we use the [Chrome DOM Distiller](https://github.com/chromium/dom-distiller) to extract the "readable" portion of the page (this is the view that appears when you use "Reader Mode" in the Chrome browswer). Finally, we used [Trafilatura](https://trafilatura.readthedocs.io/en/latest/) to extract the text from the DOM-distilled HTML.
Converted to markdown with [euangoddard.github.io/clipboard2markdown/](https://euangoddard.github.io/clipboard2markdown/).