Tuesday, December 12, 2006

Title is NULL

Gary Myers has just written about NULL, making the great point that NULL does not mean "unknown". He cites his middle name as proof of this. His parents never gave him one. Hence, a NULL for his middle name does not represent an unknown value because Gary knows his middle name does not exist. But how can we tell that from a SELECT on the CUSTOMER_NAMES table? This is why NULL is such a slippery customer: it is not just the absence of value, it is the absence of meaning too. A NULL in a column is fundamentally uninterpretable.

For instance, a NULL in the column MIDDLE_NAME might represent non-existence, someone like Gary who doesn't have a middle name. Alternatively it might mean inapplicability, because the entity we are recording is an organisation for which the concept of middle names is nonsensical. It might indicate data withheld because the person has an embarrassing middle name and declined to tell us what it is. It might portend cultural incomprehension, when the respondee literally does not understand the concept of "middle name". It might simply be a case of slack input: the operator forgot to ask or forgot to enter the value.

Hugh Darwen, in his UKOUG 2005 presentation, said that this aspect of NULL really bothered Ted Codd and that in his later life Codd devised a scheme for representing all these different flavours of NULL. Apparently Codd had identified twelve or thirteen distinct "meanings" of NULL. A project that surely invites an index finger twirling by one's temple.

The fact that NULL is not just an absence of value but an absence of meaning is one that often escapes people. Once on a project I had the task of editing a whole slew of Forms modules, setting every mandatory Y/N flag to be optional and changing the default value to be NULL instead of N. For the customer it was quite obvious that the default value for the question "Have we received the letter?" should be NULL because the answer is unknowable until the operator actually gets to that page and sets a value. So for them the meaning of NULL was "This question has not been answered". Unfortunately, in practice NULL could also mean "The operator doesn't know whether we have received the letter". Or "The operator skipped this question". Or "The customer has not sent us a letter." Or ... well, you get the drift.

Unfortunately the solution is not as simple as adopting Fabian Pascal's policy of no NULLs in the database. Because it is not just enough to slap a NOT NULL constraint on every column: that path leads to magic values and they are worse than NULLs. We need to rejig our data models to properly handle those legitimate situations for which no value exists. Which ultimately means having a MIDDLE_NAMES table, consisting of keys and a solitary MIDDLE_NAME column. But we also need better joining syntax to handle all those SELECT statements attempting to retrieve data from the MIDDLE_NAMES table. Outer joins are no good: they just re-introduce null values into the result set, so we are no better off. Modern SQL databases - not just Oracle - are not capable of representing all the different nuances of NULL.

Given the current state of SQL I think we have no choice but to live with NULLs. We just need to remember that when we find a NULL in our data it is not just that we don't have a value in this instance; we also don't know why we don't have a value in this instance.

Further reading


'NULL: Nothing to worry about' by Lex de Haan and Jonathan Gennick
'How to Handle Missing Information without Using NULL' by Hugh Darwen

9 comments:

Pete Scott said...

Perhaps we need :
Select DISTINCT NULL...

:-)

Anonymous said...

With 13 different NULLs, my IS NULL now needs to be

col IS UNKNOWN_NULL or
col IS NEVERGOTIT_NULL or
col IS DOESNTEXIST_NULL

That would wreak havoc on some trigger code.

:-)

Carlos said...

I can't understand why so much attention to NULL values. When I have to set different options like 'not known', 'undefined', etc, for an answer in a form, well I just put the options for the user to select and thats it. NULL is just NULL. ;)

Tonguç said...

Carlos I believe the attention is because when using NULL in database there are several things to be more carefull. Allowing NULL values into your columns introduces a whole new degree of uncertainty into your database. The primary reason behind is comparing NULL to a value will result neither TRUE or FALSE. Or for example;

create table tab_null_test ( x int, y int );
create unique index unique_tab_null_test on tab_null_test(x,y);
insert into tab_null_test values ( 1, 1 );
insert into tab_null_test values ( 1, NULL );
insert into tab_null_test values ( NULL, 1 );
insert into tab_null_test values ( NULL, NULL );
analyze index unique_tab_null_test validate structure;
select name, lf_rows from index_stats;
NAME LF_ROWS
------------------------------ ----------
UNIQUE_TAB_NULL_TEST 3

The table has four rows, whereas the index only has three. The first three rows, where at least one of the index key elements was not Null, are in the index. The last row, with (NULL, NULL) is not in the index.

insert into tab_null_test values ( NULL, NULL );
1 row created.

In Oracle, (NULL, NULL) is in deed <> (NULL, NULL). That is something to consider: each unique constraint should have at least one NOT NULL column in order to be truly unique.

NULLS can also affect the outcome when you use a descending sort. By default, NULLS are greater than any other value.

Oracle implements NULL values as ANSI SQL requires them to be implemented, but databases are different, also experience in one will in part carry over to another.

Thank you for your article Andrew,
Best regards.

Robert Vollman said...

To me, NULL generally means "unknown."

Gary's middle name is not NULL. It is not unknown. It's known. It's nothing. NULL and nothing are two entirely different things.

In a general sense, I believe Date's solution was to create a table of middle names, and a relation table to link people to their middle names. And Gary simply wouldn't have such a relationship.

NULL is definitely one of the more interesting topics, I've blogged on it three separate times (see my archives if you're interested).

Carlos said...

Ok Tonguc, I think I didn't make myself clear, I know there are a lot of issues using NULL values in constraints, calculations, etc, but my point is that when designing a database the last option should be to allow the user to insert a NULL value.

I think an attribute should have all the possible options like 'unknown', 'not defined',etc and if the attribute is numeric an exact range of numbers, but not let the db defines what kind of NULL was inserted...

APC said...

Robert

Well Gary definitely refutes the idea that his middle name is an empty string.

>> I believe Date's solution was to create a table of middle names, and a relation table to link people to their middle names. And Gary simply wouldn't have such a relationship.

That was the solution which Hugh Darwen presented and to which I alluded. The problem is sixth normal form usually ends up with a table per column. And we still have to handle the null values we get from the outer joins.

Cheers, APC

Anonymous said...

Hi Andrew,

Being the author of the article Mr. Myers used as basis for his middle name concern, I'll put my 2 cents to the discussion.

BTW: The article is here:

http://www.databasedesign-resource.com/null-values-in-a-database.html

Pay attention to the COUNT problem...

Now, we live in a world of compromises. But let's get to the very point of it:
The relational model says that tuples contain TRUE propositions. When a row (a proposition) contains one or more NULLs, we don't know if it's true anymore. Hence, we are outside the relational model. It's not more complicated than that.

What we really are saying, is that, given today's DB products (performance issues, multiple joins etc. etc.) we must live/deal with NULLs. Even Chris Date and Fabian Pascal says so (I've spoken to both of them). That doesn't make NULLs more sound, though.

I wish Mr. Myers also would read the parent section of my article:

http://www.databasedesign-resource.com/database-theory-and-practice.html

because there would possibly be even more material to blog on :-), however, it makes sense, doesn't it?

And as for the SELECT DISTINCT...

What you say, really, is that youe have two (or more) identical propositions... Is that relational? :-)

It's a hard world.

Best regards,
Alf

Anonymous said...

These comments have been invaluable to me as is this whole site. I thank you for your comment.