tooling to support people in making hands-on and open evaluations of search
Tomorrow I'm attending the [Union Square Ventures AI Roundtable #1: AI and Search](https://matthewmandel.com/2024/01/05/ai-roundtable-1-ai-and-search/). I'm looking forward to a dynamic conversation. I am also using it as a forcing function to write down something about what I'm really narrowing in on: developing tooling for user exploration and evaluation of search systems to support a strong new search ecosystem.
Building on my prior research I am very focused on developing shared tooling and other resources to support people in making hands-on and open evaluations of search systems and responses (particularly to public interest search topics). We need this sort of tooling to better inform individual and shared search choices, including for refusing, resisting, repairing, and reimagining search practices and tools. Such tooling might surface distinctions and options and let subject matter experts, community members, and individuals develop (and perhaps share) their own evaluations.
I have been shifting my research statement to engage with this and looking for how to make it happen, whether in academia, with foundation support, in a company, or as something new. I am working on this so that we might be better able to advocate and design for the appropriate role and shape of search in our work, lives, and society.1
There is a lot of related work on evaluation of various types of systems, benchmarking, audits, complaint, etc. to build with, but that work is not narrowly aimed at facilitating open evaluation of the performance of new web search tools on public interest search topics and to support effective voice and choice in search.
This project is intended to complement existing reporting, benchmarking and auditing efforts but focus on helping people develop their own sense of what different tools can, can't, and could possibly do.
This can be a framework and service that supports individual evaluations, collaborative evaluations, and requests-for-evaluations from peers, experts, and public-benefit search quality raters.
I imagine such tooling could be used by an agency or non-profit to issue public complaint and to refine their own content and search work. Or by individuals to decide on which new tool to start to use, or to continue refusing. Or by content creators to push for better attribution or shared funding models, or develop their own systems. Or by RAG builders to demonstrate their improvements.
Searchers, publishers, journalists, SEOs, activists, and academics have long been making complaints about and to the dominate search system and much of that is deflected and/or improvements are made that strengthen its position. We have a chance now to package our evaluations, both the good and bad that we find in search results and responses, as a broadly shared resource that might advance search in multiple ways in the public interest.
Below I try to roughly connect some of that paths that led me here.
Background
Philosophy undergrad. Intelligence analyst in the US army. Professional degree program: Master of Information Management and Systems at UC Berkeley 2016. Continued into the PhD program in Information Science. Started focusing on search practices, perceptions, and platforms in 2017. Dissertation research examined the seeming success of workplace web searching by data engineers.
Earlier
This is rooted in:
- So much prior research, particularly: @introna2000shaping, @couvering2007relevance, @vaidhyanathan2011googlization @mager2021future @tripodi2018searching, @noble2018algorithms, @mager2018internet, @haider2019invisible, @ziewitz2019rethinking, @cifor2019feminist, @lurie2021searching_facctrec, @meisner2022labor, @narayanan2022google, @shah2022situating, and @haider2023google; let alone the rich literature on search audits.
- Many conversations with Emma Lurie about public-benefit evaluations of search quality while we worked on our paper about the function of the Google Search Liaison (GSL): "Search quality complaints and imaginary repair: Control in articulations of Google Search"
- Note: The current function of the GSL is not that of an ombudsperson, though the person holding the role has done much to advance search in the public interest, see for instance: “We don’t have relevancy ratings for search engines”
- And with that I've been thinking about how SEO knowledge and tooling is a largely underutilized public good. Ex, here: SEO for social good?
Spring 2023
- The introduction of ChatGPT clearly helped many people see that search could be different. As it seemed there was an opportunity to influence the shape of search to come I looked at making myself a role in industry and started exploring generative web search systems.
- I've been reflecting on the course I taught at Michigan State University last spring on Understanding Change in Web Search and what my students taught me. I've been thinking particularly about how we implicitly and explicitly make search quality evaluations and how we might do well to share more of these and solicit feedback from others as we strive to develop our search practices and identify what we want from search. (Coming out of my dissertation research (and following the lead of @haider2019invisible) I believe it is desperately important that we talk more about search.)
Summer & Fall 2023
- I started developing some early thoughts around SearchRights and how we might help users explore, evaluate, and demand more from search and support full search choice (while there is this opportunity for disruption).
- I started thinking about tooling to support evaluation and exploration: The Need for ChainForge-like Tools in Evaluating Generative Web Search Platforms
- I sat with this line from Dave Guarino: "We really need to talk more about monitoring search quality for public interest topics."
- I attended the Task Focused IR in the Era of Generative AI Workshop at Microsoft. (Here is the workshop report.) This was energizing and built up my confidence.
- I wrote Aggressively Imagining Funding Models for Generative Web Search, asking "How might we pay for generative AI in web search?" and "Is there a new design space for search?"
- I spent time practicing building incomplete Tampermonkey userscripts (speedserper), browser extensions (Search History Search), copy for speculative services (Complainquiry), python scripts (Similaring, a Python script to run Metaphor (now Exa.ai) searches; qChecker. See also: Projects.
December 2023
- I really appreciated this line from the creator of Marginalia Search, Viktor Lofgren: “It would be really helpful to have other people dabble in search without having to build an entire search engine from scratch.”
- Started organizing thoughts as I prepare a March lecture on something like search failure and a workshop on something like search repair for Noopur Raval's "Systems and Infrastructures" class.\
- I shared thoughts about “benchmarking” democratization of good search, riffing on a HuggingFace slogan, and then asked: What organizations are best situated to apply their own resources and promote developer attention and energy to this problem and opportunity?
- A recent examination of search quality from Neeva's Vivek Raghunathan ([how much should i pay for home insurance in moss beach ca]; see also [why is pfix going down today?]) prompted me to post a Christmas wish ("someone stepping up to fund independent user-focused testing of these tools") and then I started thinking seriously about what that would take.
- Then interactions with Bruce Yu (about his evaluation of generative web search tools) and Owen Colegrove (about his open source AgentSearch) made my thinking a bit more concrete.
January 2024
- This last month I started sharing some thoughts about companies posting screenshots of search responses to think more about what might be possible. And I've started sharing my thoughts with others.
Footnotes
-
See @hendry2008conceptual [p. 277]. ↩
-
While it is not comparing search results or responses, LMSYS Org does now have 'online' models in their Chatbot Arena; see Jan 29 announcement. Current 'online' models are from Perplexity AI and Google's Bard. ↩