Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Consistent? What do you mean, "consistent"? Sometimes it's comma separated, sometimes it's semicolon separated (depending on the user's locale), sometimes it's separated by tabs (because it's a _C_SV file, yeah, no biggie), no content encoding hint (Unicode? Latin-1251? Win-1252? Nobody knows), not to mention you've written this comment under an article that shows just about the least consistent behavior ever. (Line breaks? Ahahahaha!)

The only consistent thing about CSV is its ubiquity; other than that, it's a hairy, inconsistent mess that appears simple. (Source: having parsed millions of blobs that all identified themselves as CSV, despite being almost completely different in structure.)



> Sometimes it's comma separated, sometimes it's semicolon separated (depending on the user's locale), sometimes it's separated by tabs

CSV is comma separated. [1]

Valid YAML

    foo: bar baz
Invalid YAML

    foo: "bar" baz
Valid YAML

    foo: "bar baz"
Invalid YAML

    foo: "bar baz
Valid YAML

    foo: bar baz"
[1] https://tools.ietf.org/html/rfc4180


You would think so, but people are dumb. I've seen tab-delimited files that are .CSV instead of .tsv, and I've also seen the semicolon delimiter a few times though I can't recall where. I think Excel actually pops up a prompt when importing to confirm the delimiter in some cases?

From your link, it's quite clear that you should not assume any particular CSV file to follow any particular rules.

> Interoperability considerations: > Due to lack of a single specification, there are considerable differences among implementations. Implementors should "be conservative in what you do, be liberal in what you accept from others" (RFC 793 [8]) when processing CSV files. An attempt at a common definition can be found in Section 2....

> Published specification: > While numerous private specifications exist for various programs and systems, there is no single "master" specification for this format. An attempt at a common definition can be found in Section 2.

Section 2 states:

> This section documents the format that seems to be followed by most implementations:


"All theory, dear friend, is gray, but the golden tree of life springs ever green." -Goethe

If CSV were indeed always comma-separated, my hair would be at least 5% less gray. Alas, most programs emit semicolon-separated "CSV" in some locales (MS Office, LibreOffice, you-name-it-they-got-it).

Of course, I understand that your academic position "if it chokes the RFC-compliant parser, it's not a True CSV and should be sent to /dev/null" tautologically exists - but for some reason, users tend to object to such treatment (especially when they have no useful tools that would emit your One True Format for them).

TL;DR: there is no single standard fitting all the things that call themselves "CSV".


You seem like the perfect person to ask: what is a format that is close to the (apparent)simplicity of CSV, but is actually consistent?


I am so sorry.

In other words, as soon as you start exchanging data, you'll get something that is complex, broken, or (most common case) both. Existence of a simple, consistent general format has not been conclusively proven impossible, but I have yet to see one in practice.

(Of course, everybody and their dog have cooked up simple data schemes, yes, but those are a) domain-specific, and b) not in widespread use.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: