Splitting a non-uniform concatenated address column into 2 different columns

Question

Post reply

Splitting a non-uniform concatenated address column into 2 different columns

Passionate SQL Server Developer

SSC Enthusiast

Points: 147
More actions
January 13, 2015 at 1:46 pm

#313630

Hello,

I have a very interesting problem in T-SQL coding for which I can't figure out the solution. Actually there is a Line_1_Address column in our data warehouse address table which is being populated from various sources. Some sources have already concatenated house number + street address fields in the Line_1_Address column whereas one source has separated columns for both data fields.

Now I'm trying to extract data from this data warehouse table and I need to split the house number from street address and load it into separate columns in my destination table. In case there is no data for house number then I should load it as NULL.

The issue is that data in this Line_1_Address column is very inconsistent so I don't know which functions to use. If anyone can suggest me a possible solution then that would be great. here is some sample data for your consideration:

Line_1_Address

101 E Commerce ST

120 E Commerce ST

2 Po Box

301 W. Bel Air Ave

West Main Street, PO Box 1388

116 N Louisville ST

903 Po Box

2950 W Market ST

300 N. Cleveland-Massillon Road

820 Main Street

521 1st ST N

2301 Dawson RD

611 Pointe North BLVD

173 North 2nd Street

410 Hillabee ST

115 North St. Asaph Street

115 North St. Asaph Street

115 North St. Asaph Street

115 North St. Asaph Street, PO Box 26248

1800 N. Beauregard Street

1800 N. Beauregard Street, #100

1800 N. Beauregard Street, Suite 1800

211 N. Union Street, Suite 154

228 South Washington Street, Suite 200

228 South Washington Street, Suite 200

5500 Cherokee Avenue, Suite 300

South Peyton Street

1605 N. Cedar Crest Blvd. Suite 410

2015 Hamilton ST STE 205

2222 South 12th Street

2895 Hamilton Blvd, #201

2895 Hamilton Boulevard

3570 Hamilton Blvd Ste 302

5000 Tilghman Street, Suite 325

5910 Hamilton Blvd

Hamilton Boulevard, Suite 105, The Atriu

3100 Royal BLVD S

3655 North Point PKWY STE 425

3375 Lynnwood Drive

108 Main St S

1831 E Three Notch ST

1807 N. Boulevard

1924 McConell Springs Road, PO Box 710

2315 N Main Street Suite 100

2315 N. Main Street, Suite 100

2315 N. Main Street, Suite 100

2315 N. Main Street, Suite 100, PO Box 2

N Main Street Suite 100, PO Box 2307

4200 Evergreen Lane

1610 West Street, Suite 105

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply

Sigerson Hall of Fame Points: 3697 More actions · Answer 1

First of all, good luck. You have some really filthy data to deal with.

Read up on PATINDEX. It may give you some ideas about parsing these strings. It searches for one string within another and returns an integer of the first occurrence. It can take wild card arguments which makes it useful.

Here's an example that will let you find out if your address begins with digits or not:

Declare @a varchar(30)

Set @a = '2950 W Market ST'

PRINT PATINDEX('%[^0-9]%',@a)

Result = 5, which is the first character that is NOT a digit.

[0-9] searches for a match of any digit. [^0-9] searches fpr a match on the first NON-digit.

More examples for PATINDEX in BOL

Good Luck,

Sigerson

"No pressure, no diamonds." - Thomas Carlyle

ManicStar SSCoach Points: 17992 More actions · Answer 2

You're going to have to do some sophisticated regular expression parsing to get about 90% of the way there. You probably won't be 100% successful at splitting all the records up correctly.

That being said, I would focus on the bulk, and plan on reporting on the residue. SQL supports some regex (patindex, like etc) processing, but its not going to support the more sophisticated expressions one could use for this purpose.

Passionate SQL Server Developer SSC Enthusiast Points: 147 More actions · Answer 3

Passionate SQL Server Developer

SSC Enthusiast

Points: 147

January 13, 2015 at 3:26 pm

#1769964

Thanks for your replies guy. That's what I thought too. These are just 50 records of millions of records as the addresses are for different policy holders. The problem is in come cases house number fields are 123a, 123a-112, 123, a123 etc

So I don't have any way to differentiate between whats write and whats wrong in the code 🙁

Sigerson Hall of Fame Points: 3697 More actions · Answer 4

You're going to have to do it in stages, build some temp tables, and start plowing through it. As you get individual rows sorted, place them into a holding table. Also, I'd put an integer primary key on the temp tables to join As so you can easily separate what's been sorted and what hasn't.

For the '123a' or '123-A' scenarios, once you know the position of the 1st non-digit, then look to see if the first character BEFORE or AFTER it is a space.

Sigerson

"No pressure, no diamonds." - Thomas Carlyle

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 5

You could spend an awful lot of money trying to develop this and still make some terrible mistakes that you might not ever find.

My recommendation is to simply buy some CASS certification software. It might be a couple or even several hundred dollars but that's nothing if you divide that by how much the company is paying you by the hour and how much time you're going to spend on writing something homegrown.

The CASS certification program will also do something that your homegrown solution cannot. It well play your list of addresses against actual address ranges and tell you if an address is invalid even if it's in perfect format.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Passionate SQL Server Developer SSC Enthusiast Points: 147 More actions · Answer 6

Passionate SQL Server Developer

SSC Enthusiast

Points: 147

January 13, 2015 at 8:12 pm

#1769996

Thanks for your suggestions guys. I think it doesn't make any sense for me to spend that much time on this issue since it will take too much time. I will ask my modeler to look for a work around.