Monday, January 28, 2008

Is this string a number? Really?

Jared Still ponders this question in a recent blog post. He runs some benchmarks against the various approaches and comes to the conclusion that using TRANSLATE() is the fastest approach. Which is fine, as far as it goes. It's a solution which works for Jared's situation but is not universally applicable.

Note that I have slightly complicated Jared's approach to allow for fake_number values of varying length:

SQL> select * from detect_numeric
2 order by 1
3 /
FAKE_NUMBER
--------------------
000000
000001
000002
000010
000011
000012
000020
000021
000022

9 rows selected.

SQL> select fake_number
2 from detect_numeric
3 where lpad('|', length(fake_number), '|')
4 = translate(fake_number,'0123456789','||||||||||')
5 order by 1
6 /
FAKE_N
------
000000
000001
000002
000010
000011
000012
000020
000021
000022

9 rows selected.
SQL>
So far so good. Let's add another record....

SQL> insert into detect_numeric values ('123.45')
2 /

1 row created.

SQL> select fake_number
2 from detect_numeric
3 where lpad('|', length(fake_number), '|')
4 = translate(fake_number,'0123456789','||||||||||')
5 order by 1
6 /
FAKE_N
------
000000
000001
000002
000010
000011
000012
000020
000021
000022

9 rows selected.

SQL>
Wha'pen? Isn't 123.45 is a number? Well, no, not in this context. The TRANSLATE() call is only counting digits. Hmmm, obviously we need to allow for decimal points.

SQL> select fake_number
2 from detect_numeric
3 where lpad('|', length(fake_number), '|')
4 = translate(fake_number,'0123456789.','|||||||||||')
5 order by 1
6 /
FAKE_N
------
000000
000001
000002
000010
000011
000012
000020
000021
000022
123.45

10 rows selected.

SQL>
Problem solved? Not quite. There's more to being numeric than just comprising digits and decimal points. They have to be in the right quantities and right places. Let's add an IP address to the mix....
 
SQL> insert into detect_numeric values ('127.0.0.1')
2 /

1 row created.

SQL> select fake_number
2 from detect_numeric
3 where lpad('|', length(fake_number), '|')
4 = translate(fake_number,'0123456789.','|||||||||||')
5 order by 1
6 /
FAKE_NUMBER
--------------------
000000
000001
000002
000010
000011
000012
000020
000021
000022
123.45
127.0.0.1

11 rows selected.

SQL>
And that's why we might need a function like IS_NUMERIC(), which wraps a TO_NUMBER call:

SQL> create or replace function is_numeric
2 (p_str in varchar2, p_fmt_msk in varchar2 := null)
3 return varchar2
4 as
5 return_value varchar2(5);
6 n number;
7 begin
8 begin
9 if p_fmt_msk is null then
10 n := to_number(p_str);
11 else
12 n := to_number(p_str, p_fmt_msk);
13 end if;
14 return_value := 'TRUE';
15 exception
16 when others then
17 return_value := 'FALSE';
18 end;
19 return return_value;
20 end;
21 /

Function created.

SQL> column is_numeric format a10
SQL> select fake_number, is_numeric(fake_number) is_numeric
2 from detect_numeric
3 order by 2,1
4 /
FAKE_NUMBER IS_NUMERIC
-------------------- ----------
127.0.0.1 FALSE
000000 TRUE
000001 TRUE
000002 TRUE
000010 TRUE
000011 TRUE
000012 TRUE
000020 TRUE
000021 TRUE
000022 TRUE
123.45 TRUE

11 rows selected.

SQL>

All of which underlines the importance of understanding the data with which we are working. If we just need to assert that a string consists solely of digits then a simple TRANSLATE() will suffice and will be very efficient. But if we need to assert something more precise - that the string contains a valid number - then we may require a slower but more reliable approach.

Incidentally, anyone who is interested in seeing how to use 10g's regex functionality to winnow numeric strings from non-numeric strings should read this OTN Forum post from CD.

No comments: