PDF Datasource

  • All,

    We have a requirement where in we need to load data from PDF file. If any of you have worked on such a scenario please throw some light.

    Thanks

  • That doesn't seem practical. I'd be asking why you're getting data in PDF format, when the vast majority of software programs out there have other output options. I suspect you could do it if you can find a way to script Adobe Acrobat, but even then, it might turn out to be much harder than you ever anticipated. Google the web for PDF to text converters, but don't count on anything you find being able to produce consistent results across differing PDF files. Also, you'd need to know if the PDF files only contain a scanned image, without any OCR conversion, at which point you're probably dead in the water.

    Steve

    (aka smunson)

    :):):)

  • We had found a way to deal with it. we found a product called ActivePDF that would convert PDF in to XML, thot would load the data from XML as we have mostly tabular data in PDF, this is helping us;

    Can Gurus shed some light in processing Excel Reports in SSIS. Our requirement is such that we have data with out label in Excel file that has to be loaded in Staging.

    for eg.

    Model Quantity Price

    Electronics (Category)

    Television (SubCategory)

    21" Flat 30 $200

    25" Non flat 12 $300

    Audi systems(SubCategory)

    21" Mini 33 $2300

    25" Hi-FI 123 $3300

    I have highlighted the column header that will come in report, and the Category,subcategory are mentioned for illustration.

  • Ok, so you ended up with XML that isn't structured the way you need it to be to get SSIS to be able to import it in a practical fashion. I'm not sure that ActivePDF is thus a solution, but you are closer than I initially envisioned. You may have to seek out an XML manipulation tool that can automatically "repair" the structure of the XML to more closely match the record structure you desire.

    I have this same problem with Quicken's QFX data format for downloads from financial institutions. It's SGML, and I found a tool that can at least convert to well-formed XML, but the data structure I end up with is intractable, not unlike what you have here. I suspect that "flattening" is the word for what we'd both like to do to our XML, but I'm not sure that's all that easy to do, despite the presence of flattening tools. I tried one to no avail, and had to give up due to the considerable impracticality of where I was going to end up.

    Steve

    (aka smunson)

    :):):)

  • You are rite...Need to flatten the input... Otherwise also, If we get XML some tedious parsing needs to be done to extract the data and map it to the columns

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply