I think saving your entire workspace is a bad idea too, sorry!

sandGorgon · on March 30, 2016

Could you talk about why? Other than convenience factor (and R already does it), could you talk about why.

Is it stemming from a fundamental aspect of the data format - for example can you save two data frames to the same file?

Because if you can save two - why not save two hundred.

peatmoss · on March 30, 2016

It thwarts reproducibility. By saving your workspace, it drags a lot of state from session to session that isn't accounted for. If you share code with someone else, their workspace space won't be the same, and thus the code may not function the same.

sandGorgon · on March 30, 2016

Point taken. But we are again delving dangerously close to thou-shall-not . from my perspective, it is a quick and convenient way to save all the data frames in my code. It's a boon for productivity.

If not this, then I pray for Feather to be able to save multiple data frames innone file.

peatmoss · on March 30, 2016

I don't see it as a thou-shalt-not. As a file format, feather is is lightweight. If they turned it into a container format it would be expanding the scope. If they instrumented it to comb objects in the global namespace and serialize them to the new container format, it would be heavier still--all to support a feature that the authors view as an anti-pattern. That's less a thou shalt not than it is a prioritization of their own vision.

If you're looking for a container to store lots of tabular data in one file, I'd suggest SQLite. Using dplyr, you can save those dataframes very easily. Plus, you can join tables and perform efficient aggregations on datasets too large to keep in memory.

In a lot of ways, I don't understand what limitations prevent SQLite from becoming the defacto common data.frame format. There probably are some, I just don't understand the tradeoffs (especially given how much SQLite gives you for free)!

sandGorgon · on March 30, 2016

Actually this is interesting - whhy Feather vs sqlite. I would love to know the answer!

But coming back to the anti-pattern : well, obviously the authors have the power to not spend time on something. But I'm trying to figure out why it's an anti-pattern in general. Snapshotting execution state is probably the ideal goal, but saving intermediate data structures is a decent convenience feature.

Now if that's restricted by the limitations of the format itself (no multiple frames in a single file), then we are back to thinking that HDF5/sqlite may indeed be the better format.

hadley · on March 30, 2016

Basically because you should be encoding state in code, not data. If you store data between sessions, it's easy to lose the code that you use to create it and then later on you can't recreate it.

It is convenient to save your complete workspace but I've seen too many cases where it's contributed to lack of reproducibility to spend my time working on it.

sandGorgon · on March 30, 2016

So there's a use case difference. I create models from remote data sources - this is incremental on a daily basis and takes quite a bit of time.

So I snapshot the workspace after I do a run and do some experiments. Now - for me, saving the workspace is a convenience feature, NOT a programming feature.

This is what I mean by thou-shall-not. My use case is very well defined and I'm not stupid. And I completely knows the pitfalls of what you talk about - but a philosophical opposition is what hurts me (and lots of devs like me)

hadley · on March 31, 2016

I hope I didn't come across as "thou shalt not" - it's just never going to be high in my priority list.

(And even for your use case I would think you'd be better off keeping the models in a list and saving that. Then other random stuff in your evn won't get carried along for the ride)

sandGorgon · on March 31, 2016

Oh no you did not! That was polite musing. Thank you for the reply - I still hope you change your mind. Because people do have genuine, but different needs ;)