[mCaptcha][Open Tch Fund] Proposal

# [mCaptcha][Open Tch Fund] Proposal ### Project Title* This project name can be changed if a full proposal is requested. mCaptcha ### What is your name?* You are logged in so this information is fetched from your user account. Aravinth Manivannan ### What email address should we use to contact you?* You are logged in so this information is fetched from your user account. realaravinth@batsense.net ### How much funding do you estimate you will need? (In US Dollars) 26970 ### Please include a full description of your project, including the problem statement, your approach, and the main beneficiaries. mCaptcha is a privacy-focused CAPTCHA system that uses a Proof of Work algorithm to rate limit visitors. Existing CAPTCHA systems rely on tracking methods like cookies, IP logging, browser fingerprinting, and other privacy-invasive technologies to record and analyze visitor behavior. The problem with this approach is that they blindly deny access to all visitors without recorded history. So people who use hardened browsers, Tor, VPNs, and other anonymizing techniques get denied access by default. Not only are they inaccurate under those circumstances, but they also deny information to those that are vulnerable and probably the most desperate for information. Furthermore, existing CAPTCHA validation algorithms are not transparent. They can be used to deny access to information for various reasons, and there are no means for the public or even a governing body to verify its bias. A rogue nation, an authoritative dictator, or even the company selling these CAPTCHA systems can use it to prevent certain groups of people from accessing certain types of websites on the premise of the visitor resembling bots. The internet is the easiest way to access information in this era. To guard such critical systems with opaque, black box, and closed-source technologies jeopardizes the ability of the most vulnerable to access life-saving information. The mCaptcha validation algorithms are well-documented and based on strong cryptography. Moreover, it is Free Software, so the decisions of the validation algorithm can be audited independently by anyone. Proof of Work ensures that the visitors spend time computing work. Adaptive Proof of Work can detect spikes in traffic and increase the difficulty factor so that each request takes a little bit longer than usual. But mCaptcha is an experimental technology. If misconfigured, Proof of Work can make websites inaccessible to visitors on an older, slower device that can spend a long time computing PoW. There are other problems: can mCaptcha scale? How will it handle attacks on its server? FOSS software users have no choice but to rely on privacy-invasive CAPTCHAs like reCAPTCHA and hCaptcha as there are no alternatives. They also have terrible accessibility, as pointed out by the W3C. Some FOSS projects like Gitea, a popular Git-based software forge has already implemented mCaptcha support in their project. I hope more projects take notice and move to privacy-focused tech like mCaptcha. Also, the usage of privacy-focused software, in general, is on the rise. Plausible.io, a privacy-focussed Google Analytics alternative recently celebrated a new milestone in their business SASS offering. I hope a similar change occurs on the CAPTCHA scene and the internet adapts mCaptcha. Stateful systems like reCAPTCHA/hCaptcha, even with privacy-improving extensions like PrivacyPass, pose grave privacy risks to vulnerable visitors. Only systems that don't use any historical state can guarantee the privacy of their visitors. Such systems should also be effective in guarding against bot abuse. mCaptcha's PoW algorithm is a good candidate for this problem since all users are treated the same way without any preferences: no tracking and fingerprinting. Even if mCaptcha is legally forced to hand over tracking information, it won't have any compromising data to handover. The FOSS aspect allows for it to be completely decentralized. ### Please describe your project's objectives and related activities, and resulting deliverables if applicable. **Objective 1:** We plan on measuring Proof of Work accessibility on various devices through a public survey. The survey software is partially ready but currently, it only measures PoW benchmarks for the WebAssembly (WASM) library. mCaptcha also ships a JavaScript polyfill to support older browsers. Support for the polyfill needs to be implemented. Also, the results of the survey must be published in an open, freely accessible format. * **Activity 1.1:** Implement support for and JavaScript polyfill library in survey software. * **Activity 1.2:** Collect and process the results in real-time. Periodically compute percentile scores of the benchmarks * **Activity 1.3:** Periodically publish results, along with percentile scores and recommendations for maximum accessibility for use by webmasters. Estimated time: 2 months **Objective 2:** we plan on implementing horizontal scaling to support large, popular websites. mCaptcha's current implementation scales horizontally to support many websites on the same server with its actor model implementation but isn't capable of supporting large, popular websites. * **Activity 2.1:** Create distributed cache that implements a leaky bucket algorithm. The implementation should be consistent, fault-tolerant, and performant. It should be capable of supporting large websites' workloads. * **Activity 2.2:** simulate failure scenarios like multiple node failures and verify the system's ability to recover and maintain the correct state * **Activity 2.3:** Create multi-node DDoS attack simulation for benchmarking. Requires Infrastructure as Code to deploy both mCaptcha and the attacking nodes. The infrastructure as Code created in this activity will also have the side effect of enabling easy deployment by third parties. * **Activity 2.4:** Benchmark and compare performance against the semi-distributed, Redis-based cache implementation that is currently used. The distributed cache must perform better than the current implementation for workloads representing popular, internet-scale websites. Estimated time: 7 months **Objective 3:** We plan on creating full system integration tests that cover the whole configuration matrix. mCaptcha offers various deployment combinations to cater to a diverse set of needs. It is designed to meet the needs of both self-hosters and large enterprises and it does so through a varied set of configuration options. mCaptcha is currently a single-person project. I believe adding full system integration tests will significantly improve the maintainability of the project and make it sustainable in the long run. * **Activity 3.1:** Create Selenium-based integration testing with the following configuration options: 1. PoW options: i. WASM library ii. JavaScript polyfill library 2. Database options: i. Postgres ii. MariaDB 3. Cache options: i. embedded ii. Redis-based(uses [mCaptcha/cache](https://github.com/mCaptcha/cache), a custom Redis-module) iii. Custom distributed cache 4. On browsers: i. Firefox ii. Chromium Estimated time: 3 months ### How long will it take? (In Months) 12 months ### Please elaborate on the technical feasibility of this effort. Objective 2 (distributed cache) and Objective 3 (integration tests) are well-established technologies with very little room to innovate but Objective 1 (Proof of Work accessibility) is tricky. A motivated attacker can submit false data in large quantities to corrupt the survey and manipulate mCaptcha into providing weak difficulty factor recommendations. The solution has to account for the following constraints: * We have no means to verify the benchmarks received since it can't emulate the participant's device. * We also can't use peer-validation through homogeneous redundancy, [like the BOINC project does](https://boinc.berkeley.edu/trac/wiki/HomogeneousRedundancy), since benchmarks will run in the browser and not natively so we can't verify device characteristics (browser APIs can be spoofed, browsers like [librewolf](https://librewolf.net/) do it for evading fingerprinting). #### Solution If the benchmarking module is loaded into a production mCaptcha instance and the time taken to solve CAPTCHAs is continuously measured and averages and percentile scores continuously computed too, we could avoid benchmark data corruption. The only way the attacker can introduce corruption in the above-mentioned data is by solving the majority of the CAPTCHAs that are served by a mCaptcha instance. The benchmarking software can be converted to accept and store benchmark statistics from multiple, independently run mCaptcha instances to provide an overview of the ecosystem. ### Please describe your usability/UX and accessibility practice for developing this tool. Dedicated accessibility audits are yet to be conducted, but some areas of improvement are already identified. This proposal will help solve the identified problems. Proof of Work being CPU-bound can provide subpar UX on older devices. This concern was repeatedly raised during the Codeberg pitch[0] and various discussions on the Fediverse and the project chatroom. Objective 1 (Proof of Work accessibility) will provide continuous feedback on the system's accessibility and make automated recommendations to webmasters who will integrate mCaptcha into their web services. It does it by aggregating performance metrics from all deployed mCaptcha instances[1] and will recommend Proof of Work difficulty factors that will work for the majority of the devices. It will recommend difficulty factors for various percentile slabs, and the webmaster will be given the choice to support slabs that work best for them. There were other requests from the community. For instance, there was a request to provide visual feedback on the Proof of Work generation progress. Currently, there is no progress bar so it is impossible to tell if the CAPTCHA is making any progress or has become unresponsive. Also, there have been discussions in the official chatrooms to build an invisible version of mCaptcha. The invisible version will listen to form submission events and automatically generate Proof of Work, without any user interaction. This, we believe, will greatly improve the accessibility of mCaptcha. [0]: Codeberg is replacing its CAPTCHA system with mCaptcha for better accessibility. I pitched mCaptcha in a public meeting organized by them [1]: sharing statistics will be optional and can be controlled through configuration, but it will be highly encouraged ### Please describe the scope of this project’s alternative analysis. If the project has been audited before, please briefly describe the scope and outcome of those assessments, providing links to reports if available. #### DoS mCaptcha with invalid PoW [An independent security researcher identified a critical vulnerability in the software that will allow an attacker to perform Denial of Service mCaptcha instances by sending invalid Proof of Works for verification](https://github.com/mCaptcha/mCaptcha/issues/37). This vulnerability was mitigated by implementing IP-based queued scheduling for PoW validation. Scheduling PoW validations based on IP address will run PoW validations from IP addresses in rotation. Multiple validation requests from the same IP address will be queued and executed when the same IP address is next scheduled. This way, the IP addresses sending too many validation requests will only be executed freely, i.e, without penalties through delay, when there are no requests from other IP addresses --- highly unlikely for even small deployments. This is different from IP rate limits as IP rate limits handout blanket bans. A queued execution model will eventually execute the validation job, which I think will offer better usability in Tor and VPNs than IP rate-limiting. But this mechanism can be abused to affect validation of IP addresses on Tor. If a critical web service is protected by mCaptcha and if an oppressive regime wants to disallow access to that website, they can do by sending requests through Tor, out of all exit nodes. This solution is currently implemented within mCaptcha. It is less than ideal than having a magical non-DoS-able verification endpoint. I will continue to investigate ways to find better mechanisms that will work with the constraints stated above but for now, this should work. ### Are there other efforts similar to you what you are proposing? Does your work build on their work? What makes your approach different? - [Friendly captcha](https://friendlycaptcha.com/): PoW-based CAPTCHA but non-free. A libre, self-hostable version is available but is stripped down and doesn't have critical features like adaptive difficulty scaling(which improves UX). Little to no documentation on how the CAPTCHA system works. - [pow-captcha](https://git.sequentialread.com/forest/pow-captcha): PoW based CAPTCHA libre implementation. Doesn't support variable difficulty scaling. Uses Scrypt against SHA256, unsuitable for large-scale deployments as Scrypt is more resource intensive. Integration libraries and client libraries don't exist for popular stacks yet. mCaptcha is fully libre and aims to be usable by even the most popular web services. ### Please describe your approach to monitoring and evaluating the outcomes and impact of this effort. mCaptcha doesn't have any measurable direct, solid metrics but there are measurable side effects: 1. **Proof of Work accessibility survey:** The survey software will keep count of the number of participants, and visitors and independently-run mCaptcha instances. Reporting by independent instances is optional. So while this isn't a reliable source, it will provide quality information on the project's growth. 2. **Growth of the mCaptcha SaaS offering:** I plan on launching a commercial SaaS offering using mCaptcha. The commercial operation will use 100% FOSS software (billing, Infrastructure as Code, etc. will be distributed under FOSS licenses). Its growth will be publicly documented. 3. **Integration support in third-party FOSS software:** Gitea currently supports mCaptcha, apart from their custom image-based CAPTCHA, reCAPTCHA, and hCaptcha. Lerntools, a FOSS educational technology project [is considering implementing mCaptcha support](https://codeberg.org/lerntools/base/issues/146#issuecomment-583011). Third-party support in FOSS software projects is another way of measuring mCaptcha's success. ### What's your long term strategy for this project beyond OTF support? I hope to bootstrap captcha by seeking funding with OTF, NLnet, and other similar organizations. In the long run, I hope to create a complete DDoS protection suite that can provide functionality similar to Cloudflare. Reliance on centralized, black boxes for spam and bot protection poses grave risks to internet privacy but security is essential. These services are so widespread that the browser sandboxing mechanisms don't apply to them. They read all cookies and have unrestricted access to all their customer websites' browser states. I believe creating a 100% FOSS protection suite will decentralize the security scene of the internet it will enable many operators to offer commercial operations using Free and transparent software. To fund further development, I plan on setting up a 100% FOSS SaaS offering with mCaptcha. ### Why are you, and your team members, the right people to work on this project? > You can describe your previous work, link to relevant past projects, or explain other forms of knowledge of the subject matter. Aside from mCaptcha, I actively work on [ForgeFlux](https://forgeflux.org), a project that is involved in software forge federation, and [Gna! (previously Hostea)](https://gna.org), which is a project trying to offer 100% libre, managed Gitea hosting and [Librepages](https://librepages.org), which aims to build a 100% FOSS global CDN for JAMStack hosting. I'm also a strong believer in Free(as in freedom) Software, so all my work is exclusively licensed under Free Software licenses. Internet privacy is not accessible to everyone, but I'm in a privileged position to self-host services, which are available for anyone to use free of charge at [batsense.net](https://batsense.net/services). > You can describe your previous work, link to relevant past projects, or explain other forms of knowledge of the subject matter. [Codeberg](https://codeberg.org), based on [Gitea](https://gitea.io), is the first popular service that [will adopt mCaptcha to solve accessibility issues and replace their current CAPTCHA](https://codeberg.org/Codeberg/Community/issues/479#issuecomment-581973). This first production deployment is a huge opportunity for mCaptcha and will provide proof of its accessibility claims. I'm an active participant in the software forge federation ecosystem (projects like forgefed.org, forgefriends.org, gna.org and forgeflux.org). Gitea is a popular software forge in this circle, but Gitea sysadmins deal with spam on their instances regularly. I personally administer two Gitea instances and my involvement with the software forge federation projects give me a unique opportunity to help with spam protection. With Gitea getting mCaptcha support, I hope to get continuous feedback on mCaptcha. ### Please provide a budget narrative explaining the budget required to execute your proposed project. The budget pays for my time and a minimal cloud budget to achieve testing and benchmarking in Objective 2. ### Please upload any supporting documents to your application. No attachments