JoT
Note
coding

Scraping in Chrome, SPA Link Mode, Union Type to Input.radio

2023-06

Huidong Yang

So I've been trying this new mode where multiple projects are onging, and those projects are also more experimental in nature. Ultimately I would like to turn them into real products, because that's the actual point, I'm not doing research here.

In addition to exploring a different project strategy, this mode is also more flexible if external work comes in, as I just started looking into freelancing opportunities in June. I want to be able to pause my projects to focus on contracts, and I believe this "round-robin" style is more compatible. Sure, the "focus level" for each project will suffer, but the flip side is that I will be forced to prioritize interesting PoC and essentially distinctive features over the cosmetic and ergonomic details. (Also, less time spent up front on refactor-able technicalities like naming things, must learn to cut the obsessiveness and focus more on fluency and flow) I do still enjoy the art part of app making, but at the same time, I want to learn more professional skills, e.g. design using proper tools (Inkscape, or Figma) instead of sketching on paper or doing it in my head. Not sure how to squeeze the extra fun in though.

Web Scraping in a Chrome Extension

In short, web scraping via the offscreen API works, in fact, it's one of the intended uses ("reasons"), but as I suspected, it must be done inside an iframe, and turns out, most owners of major sites disable this mode of loading their sites, for reasons I can understand. Fortunately, a browser extension is a local piece of software that can be considered as part of the browser itself (instead of a public site that for instance piggybacks on an existing site), and therefore, we can modify the response from a site server so that such restrictions are lifted. In particular, we use the declarativeNetRequest API to remove the response headers that prevent the browser from loading the page if it's loaded in a "sub frame". There's this well-known X-Frame-Options: deny, but turns out, modern browsers are ignoring that header now, and instead, it's all crammed into Content-Security-Policy now. In particular, it's this frame-ancestors directive, when set to 'none' (meaning the top frame), that will make browsers block the sub-frame loading. Ideally it'd be nice to just remove that single directive from the CSP value, but for now, removing the entire header does the job w/o noticeable side effects. Note in particular that the API allows you to specify the condition of the header removal such that the action is only taken when the page is loaded in a sub-frame (resourceTypes: ["sub_frame"]), and the request is from the browser extension itself (initiatorDomains: [chrome.runtime.id]), so the normal browsing of those sites are not effected.

Now once you load the target page in an iframe in your offscreen document (of which Chrome currently only allows one running at a time, so extra code is needed to take care of race conditions), how do you get the content out? At first I thought the most straightforward approach would be to somehow query into the iframe directly within the offscreen doc; but turns out, the DOM API to do that is unreliable if not impossible, and instead, a working solution is to dynamically inject a content script into the target page (via the scripting API, where you must specify to inject into allFrames: true, namely sub-frames as well as the top frame), and there inside the content script, you can reliably extract the rendered HTML (document.documentElement.outerHTML). Then you simply send it back to the offscreen doc via messaging.

But wait, what we want from this extension is to scrape the content per specified CSS selector (via querySelector), not the entire document. Question is, where do you get this selector from within the content script where you do the DOM scraping? Well, this user-generated selector is persisted, so you can fetch it from IDB for example, but that's dumb because this info is already known at the time you initiate the scraping procedure. Can't you tell it via messaging? Well, the issue is timing. At the very beginning, we already know the selector, but can't send it to the content script yet, because the page takes time to load in the iframe, and the script injection only takes place when the document is fully loaded (RegisteredContentScript.runAt: document_idle, which is the preferred and default config). OK, then in our content script, we could send a message to the offscreen doc requesting the selector for the target page, but then how can the offscreen doc know, upon receiving the question, which selector to reply with? The content script only knows the URL via window.location.href, but by design, a URL can have multiple selectors that a user wants to monitor the content of. Yes, it could be doable by setting up some elaborate lookup table via some ID scheme, but that's missing the point, because we already know the answer, the only problem is how to find the right timing to use it.

Then I realized, perhaps the content script is just not the right place to do the selector querying. The fix is simple, let the content script just scrape the entire DOM content and send it all back; since each lifecycle of our singleton offscreen doc corresponds to one unique scraping task (target URL & selector), we can save this selector as a variable within the offscreen doc when the task started (i.e. upon receiving a message tagged "StartScraping", for example), and when it receives the second message, this time from the content script, e.g. tagged "DocScraped", with a payload that is the entire document HTML string, it can then do the DOM parsing and selector querying using the saved variable. This is I think a much more straightforward way to "synchronize" these different stages of a scraping procedure so that it behaves rather like one single procedure call (even though multiple messages are passed underneath), and thus no info needs to be re-fetched in the middle. Is this the best way? Probably not, but we'll see as we work further.

So this shows that one of the benefits of working on this browser extension project is getting to really practice programming in the mental framework of message passing, it forces you to think about coordinating different actors (the offscreen doc, the content script, etc), when some task can't be completed by a single entity. It's pretty fun. In comparison, message passing in Elm is much simpler, because there are only ever two actors: the Elm side, and the JS side (hence two types of ports). In Extensions API, we typically have to include this extra field in the message to specify the "target" (or "recipient" as I like to call it), in addition to the tag and payload.

Using Real Path in SPA

So I've been using fragment-based links for an Elm SPA, and that works perfectly fine. But some people might be weirded out by always seeing /#/ in the URLs. The solution I know of is quite straightforward - you configure your server to redirect requests for all paths to /index.html. With Nginx, for example, you can use the try_files directive to set the root index file as a fallback.

location / {
    try_files $uri /index.html =404;
}

And in Elm, it turns out to be very easy to switch between the two "link modes", namely the fragment-based and the path-based, for specifying routes. You simply add a config variable to the model, which you use as a parameter in the Route.toLink function. Note that for Route.fromUrl, however, we don't even need to pass in the config, because we can tell which mode we're in by checking whether the URL contains a fragment (and whether the fragment starts with "/", but that's just my own convention, it doesn't have to).

fromUrl : Url -> Route
fromUrl url =
    let
        canonUrl =
            case url.fragment of
                Just frag ->
                    if String.startsWith "/" frag then
                        { url | path = frag, fragment = Nothing }

                    else
                        url

                Nothing ->
                    url
    in
    Maybe.withDefault Unknown (parse parser canonUrl)

Now one more subtlety: using real path-based links doesn't make sense in the "Editable" mode (where you add data to the app in a local environment), because there is no real benefit in that, yet it does come with the cost of having to complicate your local server config; moreover, when using a dev server, usually such redirect option isn't even possible. So I decided that this preference of link mode should only be taken account of in non-editable (i.e. presentation) mode, where the app is deployed along with data files to the cloud. The downside of this decision is, "Route.toLink" has to take one extra parameter (Model.editable).

Essentially, I still think this is a gimmick, to create an illusion that each route is backed by a file, namely, it's an attempt to pretend that an SPA is a multi-page app. Why? Is it just because "/#/" is awful looking?

Maybe there're other considerations. SEO, and initial load time, but that's actually outside the standard scope of SPAs, where people use a hybrid approach such that each path is indeed backed by a file, which loads fast, but each of those files then "rehydrates" back into the SPA. There's a term for that I guess, not sure, "isomorphic"? Never mind, it's an abuse of word anyway.

Mapping a Union Type with Data to Input.radio

So the app essentially contains a list of different variants of text-based works, right? For instance, it could be a book, or a blog, etc. Now we naturally define a union type, and each variant has an associated record that contains the actual content.

Now when we're making the UI for creating a new "work", we need to first pick which variant it is, and the natural choice is of course Input.radio. However, conventionally, each option corresponds to a tag without any associated data, i.e., a plain enum if you will. But if that's the approach we are to take, then we need to create a second tag-only union type from the full type with data, and keep the two always in sync. Is that good? (And if we want to keep the two in the same module, then we need to come up with two different naming schemes in order to avoid name clashing).

So what's alternative like? If we use the full union type directly in the Input.radio construction, will it look crazy?

Well, first of all, for the associated data, we need to pre-populate it with a default record, e.g. Book.default for the Book variant. That's totally fine.

However, the problem occurs in the next step, where you enter actual data for that specific variant, e.g. the title of the book. And as soon as you type anything for real, it will deviate from the default value (like "Untitled", or empty string), right? And when that happens, the radio button corresponding to that variant will no longer be selected, and none of the buttons will be! Because it has to be the exact value assigned to that button, which includes not only the variant tag, but also the default data associated with that tag. When you change that, the value no longer matches that of the button, so that button will be deselected.

I don't see a straightforward way to fix that directly, because the option per button is typically static. However, I find that a change of UI design can resolve this issue very well.

That is, as soon as you select the variant among the radio buttons, you hide the radio group immediately, and instead it's time to make visible the data input fields corresponding to the selected variant. This way, as you enter data, you will no longer see the deselection of the button. Now you might ask, what if I want to change the selection? Easy, use a "Cancel" button, once you click it, you will discard the entered data, the input fields will be gone, and the radio buttons will be shown again.

I think this is an interesting example where a different design of UI workflow helps resolve a difficulty in displaying complex data types in a component that's meant to work with simpler data. Since the alternative solution will entail a duplicated shadow type, I think this workaround from a different angle is a good tradeoff to make.