Monthly Archives: February 2015

What is JSON and why do we see it in some databases?

I recently team-taught an all-day course at my company on “Introduction to SQL and Relational Databases.” It was a great experience and I had a lot of fun — feedback so far has been good :).

We were talking about all sorts of SQL basics, like:

1. DML vs. DDL vs. DCL vs. Other (where Other are things like vendor-specific commands to start and stop the database)
2. Datatypes in SQL
3. WHERE clause and permutations thereof
4. Introduction to Joins, basic INNER, LEFT OUTER, RIGHT OUTER, etc.
5. I gave a quick demonstration of what “normalization” means and how it can be useful to avoid problems like data inconsistency and duplication. I did not actually teach how to normalize and model data, however, because that is a whole involved other discussion that we could talk about for days on its own.

Some questions came up that were fairly involved that I could not answer during the course. one of those questions was “What is JSON and why do we/I see it in some databases?”

JSON is a method to represent data in a text file. It is a kind of data serialization . Data serialization is a tool when you are working in a system and want to export data and send it to another system to import. The “old-school” method of data serialization is “flat text” files, like comma-separated or tab-delimited files. CSV or tab-delimited files can be useful but are problematic because if you don’t know what the fields mean in them beforehand and they also don’t allow any representation of hierarchical relationships in the file.

So then people dreamed up the idea of XML, which actually has tags in the file to represent the data and also allows hierarchies to be represented.It also has a bunch of other advantages, like allowing definitions of schemas to check variable types. Simple XML example:

<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</description>
   </book>
</catalog>

((From here: https://msdn.microsoft.com/en-us/library/ms762271%28v=vs.85%29.aspx))

So by “hierarchy” I mean how in this example we could have authors in books in a catalog. You can’t really do this accurately in this way in a flat-text file.

XML has problems though, like verbosity and complexity, as well as taking up a lot of space in a file to describe a given amount of data vs. comma-delimited or tab-delimited files. In the meantime, the IT community had been working on other data serialization formats, and came up with JSON.

I have worked with XML a fair amount but have not used JSON much, so take my comments on JSON with a grain of salt. JSON is a method of serialization that is really easy to use with Javascript. It is very low-effort for developers using Javascript and web applications to import and export data in a JSON format. However, as it was not originally intended to be a permanent data storage format (like an RDBMS is intended to be a permanent data storage format) it does not model relationships between data entities in the way that you can model them in an RDBMS and does not really provide all of the same tools for integrity checking of data, like primary and foreign keys, etc.

Now — why does JSON sometimes show up in a database when it is a data serialization format and generally intended for sending between systems?

Because it is so easy to import and export from JavaScript, and now some other programming languages, some folks have taken the route of storing their data natively in a database in JSON format. This is one kind of NoSQL database. To simplify, it basically stores JSON as one giant string in a sort-of table. There are a couple of schools of thought on this:

1. That anything that makes development faster and more efficient is useful and we should move in that direction (the school of thought that puts XML or JSON right into a database.)
2. That the most important thing is the quality and consistency of the data, and while developer effort is important to consider, it is not such a bad thing to make developers work harder up-front in their applications to force quality and consistency of data upon insertion or modification of data in the database rather than have to deal with data issues later. So just use JSON or XML for transmissions between systems, but not as permanent data storage formats. Model and store data that should be relational in a relational form, hopefully appropriately normalized. 1

I fall into camp #2 (angry DBA stereotype, right?) but there are lively debates going on all over the place about this kind of thing.

A comparison of data serialization formats here.

EDIT: 2/23/2015 1:17 PM PST

I was thinking a bit more about what bothers me about JSON in a database. And after reading this excellent article by Fabian Pascal about Hadoop vs. RDBMSes. To quote:

“…The core idea behind the database concept is native application neutrality–a general purpose data resource with multiple “uses” (applications) and data functions centralized in a DBMS…. ”

We don’t want to make the database cater to the application. The application (and everything else) more-or-less needs to cater to the database. The data stored in the DB is the important thing — it should be stored in a neutral format and not a format that is specific to a particular application nor particular language like Javascript. There may be 10 different applications reading and writing to the same set of data in a database, and we need a set of semantics and organization to ensure that to the maximum extent possible all of them are forced to adhere at the database level to a consistent standard and set of practices. Trusting in applications or application developers to do this is always a mistake because someone will always be lazy or forgetful or whatever. This article by Dan Pritchett admits this right off of the bat:

“…This requires moving data constraints out of the database and into the application…”

That requires you to trust the application developers/coders to truly understand data consistency, rules, etc., and 90% of the time I would not trust in this matter.