Yaogang Lian

Quiver Web Clipper

With the release of Quiver 3.2.5 for macOS, Quiver Web Clipper is now available on Chrome, Firefox, and Safari.

Quiver Web Clipper for Safari: install or update to Quiver 3.2.5 for macOS, then open Safari, go to Preferences > Extensions, and turn on “Quiver Web Clipper”.

Quiver Web Clipper Screenshot

Quiver Web Clipper was designed from the start to fullfil two purposes: a clutter-free Reader View, and clipping to Quiver. It also supports the Markdown mode, which converts the simplified article to clean Markdown, as shown in the next screenshot.

Quiver Web Clipper Screenshot

You can clip a decluttered web page to Quiver either as a text note, or a Markdown note. You can also clip just a selection to Quiver by right-clicking on a highlighted selection in the original web page and then select “Clip to Quiver” from the context menu.

The Declutter Algorithm

The most interesting piece in the Quiver Web Clipper is its declutter algorithm. This is the core algorithm that extracts the main content from a web page and removes clutter such as navigation, advertisements, and sidebars.

Such an algorithm has a variety of use cases:

  • Reader modes for browsers: Safari Reader, Firefox Reader, Mercury Reader…
  • Web clipping: Evernote Web Clipper, OneNote Web Clipper…
  • Eco-friendly printing: PrintFriendly
  • Text extraction from unstructured web pages in preparation for data mining or deep learning: DiffBot

Although there has been a lot of academic research about extracting text content from web pages, the first practical algorithm that saw widespread use was Arc90’s Readability, released in 2010. Readability was later built into a platform and a suite of apps, which eventually shut down in 2016. But the revolution brought about by Readability lives on: Safari Reader, first released in June 2010, was based on the original Readability algorithm; Firefox Reader, first released in 2015, was also based on the Readability algorithm; Mercury Reader was created by the same people behind the original Arc90’s Readability project; pretty much all other Reader extensions available today on various browsers are also based on variants of the original Readability algorithm.

The success of the original Readability algorithm was largely due to its introduction of a node ranking system based on heuristic rules involving metrics such as word count and link density. The node ranking algorithm is hyper-local: a paragraph node sends its full content score to its parent node, but the grandparent node only gets half. By using such a steep decay (the content score is only sent upwards twice), the algorithm keeps content scores very close to what they represent: the text content.

This hyper-local content score approach is a great idea, but there is a new problem: if text contents are separated by ads or other clutter, we might get clusters of high content scores. Arc90’s version addresses this issue by checking the siblings of the top candidate. This works in some cases, but if the top candidate happens to be wrapped in an extra div, this method will fail.

Further developments of the original Readability algorithm largely took two directions:

  1. Site-specific parsers
    The now-defunct Readability platform, as well as its replacement, Mercury Reader, took this direction. I can only guess that those who worked on the Readability project, getting increasingly frustrated of the algorithm’s shortcomings, gave up on the idea of a generic declutter algorithm. Site-specific parsers are much easier to write, and can be tailor-made to ensure perfect results for specific sites, but they come with the disadvantage that they have to be constantly maintained since websites change all the time.
  2. Enhancing the core algorithm
    There have been many attempts at enhancing the core Readability algorithm. Firefox Reader has an enhanced version of the Readability algorithm, so does Safari Reader. luin/readability is another popular open-source variant of the Readability algorithm. Among all these attempts, Safari Reader has made the most progress and works the best (as of May 2019).

However, none of the enhancements so far has truly addressed the fundamental shortcoming of the original Readability algorithm: it made strong assumptions about a web page’s tag structure. Firefox Reader and “luin/readability” only made small tweaks to the core Readability algorithm, so they still suffer greatly from this shortcoming. Safari Reader, on the other hand, made extensive revisions to the node ranking algorithm and added a lot more heuristic rules based on visual layout, but these rules seem to be largely based on trial and error, instead of a clear guiding principle.

For an algorithm with a large number of heuristic rules but no guiding principle, further development is difficult: any small code change or tweaking a parameter might make the algorithm work better on some sites, but break down on many others. A good test suite can only get you so far; at a certain point, the algorithm will become so complicated that no one can reason about its behavior.

So, what is the guiding principle behind a generic declutter algorithm?

At the first glance, the problem of extracting text content from real-world web pages seems to be impossible to solve. After all, the Web is a wild place, and websites can be authored in an infinite number of ways. However, we are not targeting ALL web pages, but only web pages with individual articles. This simplifies the problem tremendously. For article pages, there are three key insights:

  1. The main content usually consists of several paragraphs visually clustered together.
  2. Paragraphs within the main content are usually text-heavy compared to the rest of the page.
  3. Paragraphs within the main content usually have the same text style and visual layout.

With these guiding principles in mind, we can devise a better declutter algorithm as follows:

  1. Use simple metrics such as text length to find paragraphs
  2. Inspect the text style and layout style of these paragraphs, and find out the predominant paragraph style. Paragraphs that match the predominant paragraph style are called “content paragraphs”.
  3. Find the smallest block enclosing all content paragraphs.
  4. Do cleanup and fix misused tags

This simple algorithm turns out to work significantly better than all the variants of the Readability algorithm. For example, you can try Quiver Web Clipper, Safari Reader, and Firefox Reader on the following web pages:

  1. https://www.quantamagazine.org/universal-math-solutions-in-dimensions-8-and-24-20190513/
    Safari Reader misses the second half of the main content.
    Firefox Reader works fine.
    Quiver Web Clipper works fine.
  2. https://edition.cnn.com/2019/02/15/politics/roger-stone-wikileaks/index.html
    Firefox Reader doesn’t display the Reader button.
    Safari Reader works fine.
    Quiver Web Clipper works fine.
  3. http://http2.github.io/http2-spec/
    Safari Reader misses most of the main content.
    Firefox Reader works fine.
    Quiver Web Clipper works fine.
  4. https://www.engadget.com/2018/04/18/buick-enspire-offroad-ev-concept/
    Firefox Reader misses the first paragraph.
    Safari Reader works fine.
    Quiver Web Clipper works fine.
  5. https://senken.co.jp/posts/fashion-occupation-annual-income
    Firefox Reader doesn’t show the Reader button.
    Safari Reader works fine.
    Quiver Web Clipper works fine.

What’s more important is that the simple declutter algorithm described above uses very few heuristic rules, while Safari Reader and Firefox Reader use a lot more. A simpler algorithm is easier to reason about, easier to further develop, and works on more websites.

Disclaimer: All testing was done in May 2019, so time travelers from the future might see different results.

Yaogang Lian

An iOS, Mac and web developer. Focusing on building productivity and educational apps.