> PyPy core dev here. PyPy will always remain free and open source. No worries there. The question we and many other open source projects are trying to deal with is how to fund our developers under that constraint. We felt that it was time to explore other alternatives.
I've had great success with using pypy in production, but found that it was sometimes tricky to figure out how to fully leverage the performance wins on the table, especially if the regular way you might do a thing in cpython depended on a C extension that was either unsupported on pypy, or supported but slow. I could imagine a real demand for people with the specialized expertise necessary to do this for commercial customers, so I wonder if the plan here is for pypy be funded by this kind of consulting, and maybe exist under an entity that can make that happen (i.e., that's allowed to do for-profit consulting work)?
So I have no direct knowledge of this case, but this seems to be an issue over fiscal sponsorship. Like a lot of smaller and younger open source projects, PyPy had delegated financial and accounting responsibility to the Software Freedom Conservancy, who acts as their fiscal sponsor. SFC is a registered non-profit with professional accountants, taking donations on behalf of lots of projects including PyPy and distributing it to the project on request from the project's leaders. This is a common model in open source, there are similar organizations that do this like Software in the Public Interest, NumFocus, or the Open Source Collective.
One issue that happens is when fiscal sponsors take a cut on all revenue/donations raised by the project in exchange for their services, which is 10% for SFC [1]. Fiscal sponsors also have policies around what they will reimburse, which can become as stringent and as corporate travel policies [2]. I can totally understand a project getting frustrated at their fiscal sponsor and wanting to either start their own and do it all themselves or find another sponsor.
The admin fees you refer to are typically assessed in the context of large, institutional grants. These are grants from private organisations like the Gordon and Betty Moore Foundation, the Alfred P. Sloan Foundation, the Chan Zuckerberg Initiative, or government agencies like the US National Science Foundation. These grants have significant financial accounting requirements. There are also many other legal or operational costs associated with these grants.
When projects run these grants through parent institutions like universities, the typical admin fee is >40%. In some cases, it can be as high as 60%. Many projects are eager to enter fiscal sponsorship agreements with organisations like NumFOCUS, because so much more of their grant money goes to funding the work.
In the case of NumFOCUS, admin fees do not adequately cover the staff requirements to manage these grants. NumFOCUS "loses money" when servicing the administrative needs of these grants & it takes this responsibility onto itself solely for the betterment of the projects.
Rather than assess administrative fees similar to universities, NumFOCUS uses its other fundraising—corporate donations, event (i.e., PyData) sponsorship, individual giving—to finance its operations.
source:
- I serve on the NumFOCUS board of directors as its co-chair.
- I presented on this topic last year at the NumFOCUS Annual Summit to an audience of core developers from projects like Julia, Jupyter, Pandas, NumPy, AstroPy, &c.
- NumFOCUS budgets are public, and all of the above information can be corroborated from materials published on https://numfocus.org/
I did. Inherited a legacy web app that did stupid things in Python in memory (basically search and aggregation).
I realized a rewrite was the best course of action, but in the meanwhile the old thing had to stay up and running, and as the volume of data increased, it started to run in to HTTP timeouts as often requests took longer than 2 minutes.
I moved the thing to PyPy, and got about a 30% speedup from that. Only one lib had to be replaced with a pure python alternative, as it was using a C extension.
It bought me enough time to finish the new implementation (duplicate the data in Elasticsearch, hey presto from over a minute to about a second to get results).
I parse big XML and similarly structured files, convert them into RDF, puff them up into a (still RDF but with a lot of blank nodes) hypergraph so I can load the content into a single database and be able to trace that these two facts are related and come from this part of document A and that part of document B.
I have document parsing and SPARQL queries that can take a few minutes that I'd like to run frequently so I can keep all parts of the system up to date.
I've only benchmarked it a bit, but I found I got approximately the five times speed-up that PyPy promised. This is with PyPy based on Python 3.6. I think PyPy is switching to cffi as the way to connect to C code so most native code "just works" now.
I had to backport my code from Python 3.8; Python 3.6 lacks contextvars, but there is a polyfill for that, otherwise there was no problem.
I stayed away from PyPy for a long time because it was tied to Python 3.5 which was busted in various ways. One of those was that the filesystem path objects were half-implemented, you should have been able to pass them into anything from the stdlib that expected a string path and at that time you couldn't. Little accidents like that can slow down a technology like PyPy from being adopted.
> I think PyPy is switching to cffi as the way to connect to C code so most native code "just works" now.
As far as I know extensions need to be written for cffi specifically.
cffi is a newer way of writing C extensions, developed by the PyPy project. It was designed to have a smaller&cleaner interface to let you call C code from Python.
Here's Armin Rigo talking about it at EuroPython: https://www.youtube.com/watch?v=ejUzVcvTLgI
The CPython way of writing extensions is documented here:
https://docs.python.org/3/extending/extending.html
It seems to require you to deal with the internals of the CPython interpreter (deal with PyObject structs, reference counting, etc).
I know PyPy has some support for CPython extensions, but it has to emulate some internals and it's slower as a result.
Don't remember the details of the legacy app, but I don't recall seeing that. I think it just used dicts for the data and stored that in a blist.sortedlist https://pypi.org/project/blist/
For algorithmic code PyPy can provide substantial speedups over CPython. I've used PyPy in code fingerprinting large bioinformatics files and seen big speedups. I've also tried porting a webapp processing JSON from CPython and seen no perceptible speedup.
It's a long time since I looked, but profiling my code not much time was spent parsing / serializing JSON. Most of the time spent was manipulating dicts/lists in Python which cPython is already pretty good at since the whole language seems to basically be implemented in terms of dicsts. I don't think PyPy has the hidden class optimizations of JS engines which are able to find speedups in these types of cases.
For algorithmic/numerical code, especially if you have to deal with numpy-related data, Numba has a much easier barrier for entry, plus you remain with cpython while speeding up computation-intensive code by a few orders of magnitude.
Looks like numba has cffi support now so it would be an option. If I can dig out the code (it was about 5 years ago) I'll probably try adapting it to numba to benchmark it against pypy.
I wrote a daily utilized utility (probably still in use) that made good use of PyPy, it was pretty slow and after quick profiling I found that type check functions (PyMySQL) were being called A LOT of times. Literally changing the runtime from python3 to pypy was something like an 8x overall speedup.
We have a grid compute infrastructure for a specialized runtime environment with business-logic rules for scheduling priorities and partitioning of the compute cluster.
The control plane was implemented in Python and Twisted (event driven I/O framework for the unfamiliar), which was fit for purpose at the original scale running CPython (few thousand compute nodes).
As the number of compute nodes scaled up, we developed hotspots in ser/des of control messages, which ultimately started to affect overall cluster efficiency.
Switching to PyPy gave us an immediate substantial performance boost without really having to redo any code at all (just some FFI stuff that was probably wrongly implemented in the first place).
Eventually we realized we were going to out-scale even that (at the hundreds-of-thousands of compute node level) and ended up with a Scala/Akka reimplementation, but moving to PyPy from CPython got us a lot of free breathing room.
it is quite similar to the Arabic word بوس (Baus). I did some wiki search and found it originates from the latin word beso[1]. Languages are so awesome!
They likely mean that there will always be a project called "PyPy" that will be free software but they will move to an open core model and introduce "PyPyPro" where are the cutting edge development will take place, PyPy may or may not be crippled.
I think funding of ambitious open-source projects such as PyPy will continue to be a problem as long as these projects use pushover (a.k.a. permissive) licenses such as the MIT license. It's time for these developers to take a stand for what is fair, by relicensing to a copyleft license such as Parity [1] and selling proprietary licenses to companies that can and should pay.
It might be free software or open source (i.e. in spirit), but I don't have the patience to read potentially-dubious licenses (that's what the FSF and OSI are for!)
Though I think the Parity License is a clearer license than FSF-endorsed copyleft licenses such as the GPL, my point still stands if you substitute one of those licenses.
I'm far from an expert on this, so other commenters please correct me if I'm wrong:
GPL requires that any project that uses a GPL project as a part of it must also be free and open source.
It looks like Parity requires that you pay a licensing fee to the original creator of the Parity licensed project if you're using it for non-open-source reasons. So, it can be used for private/for-profit projects, but you have to pay for it in that case, whereas open source projects can use the code for free.
PyPy will remain a free and open source project, but the community's structure and organizational underpinnings will be changing and the PyPy community will be exploring options outside of the charitable realm for its next phase of growth ("charitable" in the legal sense -- PyPy will remain a community project).
In other words, volunteers can contribute but others get to monetize?
People like you helping people like us help ourselves - Processed World.
The idea of competent volunteers contributing is wonderful as a hypothetical, but if such things existed in reasonable numbers then this funding issue would likely be moot in the first place.
The reality is that the PyPy folk (all single-digit-number of them) have fought tooth and nail to keep the project going for well over a decade. I can't begin to imagine how much highly skilled labour has been poured in by such a small concentration, all for little more than praise and repute on a handful of IT forums.
In essence these projects live and die with funding. Donations just aren’t enough to pay the bills for full time developers there isn’t any real alternative.
I wish there was more corporate giving to foundations that could handle this sort of thing but we never built that culture in software unfortunately.
I don’t think it’s fair to frame this negatively at all, really misses the nuance of these situations
I do think he captured it quite well -- that they are leaving it as a community project but only directly monetizing it for some people feels, well, wrong? It might be more neutral if they allocate funds for bounties and let anyone claim them, with the core developers obviously being able to address most bounties the fastest.
Alternatively FOSS generation could pay for their tools instead of expecting free beer everywhere, then such projects wouldn't need these kind of gymnastics.
There are a few people who basically run PyPy development. They can do as they please. It's open source, so if you're so against it, you can make a "nobody profits" fork. Most outside contributions to open source projects are made by people who wanted to scratch an itch & then let the existing maintainers maintain that improvement. Their reward is the great software. This is still there so long as PyPy commits to remaining freely available
To add some substance, I used to have PyPy commit ability. I also have contributed almost no code to PyPy. This isn't for lack of wanting; my project has produced several interesting RPython modules which could plausibly be shared with other folks. It's because PyPy's core contributors, the dozen or so post-academic compiler engineers, are incredibly prolific and skilled compared to the rest of the contributor base. They outproduce me. Compare: One person implemented PyPy's massive-subset-of-Python typechecker, one person produced Nuitka's broken typechecker, and a small community team produced MyPy's conservative typechecker. The PyPy version's by far the best, including translation to C and a JIT generator and allowing nearly any sort of codegen to a high-level GC'd Java-like data model.
The tragedy is that the Python ecosystem broadly doesn't use PyPy and doesn't contribute much to it, neither code nor cash. Our compiler engineers are just as good as the folks working on CPython (and there's some overlap), but don't enjoy the powerful deep-pocketed corporate support.
I'm surprised that they didn't mention numba.jit, which solves the same basic problem as pypy (faster numpy calculations) but in a different way that is easier to mix with existing python frameworks.
For example, TensorFlow with numba preprocessing is easy, just install both packages and it'll work. TensorFlow with pypy requires a 5 hour compile and 40 GB of temporary storage. Plus some source code fiddling inside TF, if I remember correctly.
Even as an open source project, pypy should honestly consider who they're competing with for users and funding.
PyPy is targeted at general purpose workloads, NumPy support is totally an afterthought, so the basic problem it set out to solve was/is definitely not faster numpy calculations. Numba on the other hand is way more targeted at numeric workloads.
Besides, I don't see why they should mention any other project in a post announcing their departure from the Conservancy. The only surprising thing is no mention of the funding model they're moving to, other than a rather vague hint, "exploring options outside of the charitable realm".
Based on your comment, I would guess that you never tried out numba. Of course, it can also do general python and loop optimizations. And in my experience, numba worked for every case where I couldn't get pypy to work.
And I stand by my opinion that that is something that the pypy developers should consider: is this actually usable as a solution to practical problems? Or is there something else that people use instead? If so, why? Analyzing your competition is usually a good way to learn about your own strengths and weaknesses.
> Based on your comment, I would guess that you never tried out numba.
Well, you guessed wrong.
> it can also do general python and loop optimizations.
Yes, it can be used in general purpose workloads, with varying degrees of success. But its main purpose is made abundantly clear:
Accelerate Python Functions
Numba translates Python functions to optimized machine code at runtime using the industry-standard LLVM compiler library. Numba-compiled __numerical algorithms__ in Python can approach the speeds of C or FORTRAN.
> ... Analyzing your competition is usually a good way to learn about your own strengths and weaknesses.
Except this is an announcement on their funding situation, so strengths and weaknesses are completely irrelevant, unless Numba has a particularly interesting funding model. (The funding model is government grants and corporate sponsorship, so, not particularly interesting.)
I always liken Pypy to HotSpot in that to this day the numerical performance of the latter isn't spectacular and nobody really cares - it's built to handle the harder job of making vast tangled codebases of non-numerical application code run fast, not just tight math loops which are already handled perfectly well by other more specialized tools.
i don't think pypy is a numba.jit "competitor". Most people I know who use it, use it for pure python things, typically web servers, and nothing to do with machine learning or data science.
> @intgr the wind-down with the SFC hasn't been smooth and this is the politically-neutral, agreed-by-both-parties post. PyPy remains the same free and open-source project. Essentially we just switched to a different money-handler. We're announcing it in the next blog post.
> PyPy core dev here. PyPy will always remain free and open source. No worries there. The question we and many other open source projects are trying to deal with is how to fund our developers under that constraint. We felt that it was time to explore other alternatives.