Entity-Attribute-Value is evil ? how to avoid...

  • Jason A. Long - Wednesday, November 28, 2018 11:40 PM

    Jeff Moden - Wednesday, November 28, 2018 8:59 PM

    Totally agreed.  Like anything else in T-SQL and SQL Server, "It Depends" and there are some incredible uses for both EAVs and NVPs.

    I totally disagree... They violate pretty much every principal of the relational model. The fact that you will, by design, give up data type constraints, having BITs, INTs (of all flavors), DATEs, MONEY and everything else, all being stored as NVARCHAR(MAX)... Because you have to be able to store anything and everything...  No FK or CHECK constraints either... So it's inevitable that you'll end up with "Amount Due = 'Bird'" or "Date of Birth = $42.99"...

    As far as calling them tools... maybe... but only in the sense that a pointy rock or broken beer bottle can be considered tools... The primary difference, of course, being that I can actually envision circumstances where I'd be grateful to have a pointy rock or broken beer bottle.

    That said, I get totally understand the aversion to arbitrarily restrictive rules that limit the tools at our disposal... So... I'll keep an open mind... What is a valid use case where using an EAV is the preferred option? And when was the last time that either of you actually chose to implement an EAV, rather than standard lookup tables ?

    I haven't had direct experience with EAV but have taken the time to research this technology and I can see use cases where they could be useful.  I did have the opportunity a few years ago to interview for a position where EAV tables were being used in automated processes dealing with compliance of software products across numerous legal and industry standards.  That company happened to be Microsoft.  If they are using them in a specific use case, seems to me that perhaps it is a tool worth understanding.

  • I think there's an important distinction between sparse attributes known at design time, and a design that has to accommodate unknown attributes added by users over time.  It would be evil to start out with an EAV design to avoid thinking about database structure, but if you've got to deliver something that users can extent to handle attributes they can't tell you about in advance, then it may be the less-evil solution.

    I have seen canned applications with tables full of UserInt1, UserInt2, ..., UserString1, UserString2, ... etc columns so user-defined attributes can be added in the field.  This kind of works, but I never liked it.  If you're familiar with the database it's not too bad, but otherwise you have no idea why a query would have "AND UserInt17 = 5".  (And that guy that doesn't work here any more had all the documentation.)  If it has n empty columns for a particular datatype, you will inevitably need n+1 eventually.

  • andycadley - Thursday, November 29, 2018 12:57 AM

    What you are describing is absolutely the worst-case implementation for an EAV approach. When you start trying to use them to hold structured data, where you have known data types and relational requirements on the data. It is possible to force the design of almost anything into an EAV data model, but then you will suffer all the pain you can possibly imagine trying to maintain or query it.

    However, not all situations are quite that extreme, for a case where there is a need for user defined metadata fields on a record the design that naturally falls out of a relational model is basically EAV with the user able to create their own "field names" and assign some arbitrary data to it on a record-by-record basis. This can be an effective design, with the caveat that such metadata is typically less queryable than otherwise.

    Even if you need differing data types (lets assume a user might want to add dates or yes/no metadata), you aren't necessarily restricted to storing everything in one varchar column - you can have different tables for each of the metadata types and allow front end systems to handle the process differently.

    Ultimately EAV is a tool - it's one that is very easy to abuse and one that people who aren't database design experts sometimes gravitate towards because it looks like something that means you don't ever have to think about databases. Of course it's precisely because of those that factors that it can quickly become an anti-pattern. Certainly if you come up with a system designed such that the entire thing is EAV, then it's almost certainly fundamentally flawed.

    andycadley - Thursday, November 29, 2018 12:57 AM

    Jason A. Long - Wednesday, November 28, 2018 11:40 PM

    I totally disagree... They violate pretty much every principal of the relational model. The fact that you will, by design, give up data type constraints, having BITs, INTs (of all flavors), DATEs, MONEY and everything else, all being stored as NVARCHAR(MAX)... Because you have to be able to store anything and everything...  No FK or CHECK constraints either... So it's inevitable that you'll end up with "Amount Due = 'Bird'" or "Date of Birth = $42.99"...

    As far as calling them tools... maybe... but only in the sense that a pointy rock or broken beer bottle can be considered tools... The primary difference, of course, being that I can actually envision circumstances where I'd be grateful to have a pointy rock or broken beer bottle.

    That said, I get totally understand the aversion to arbitrarily restrictive rules that limit the tools at our disposal... So... I'll keep an open mind... What is a valid use case where using an EAV is the preferred option? And when was the last time that either of you actually chose to implement an EAV, rather than standard lookup tables ?

    What you are describing is absolutely the worst-case implementation for an EAV approach. When you start trying to use them to hold structured data, where you have known data types and relational requirements on the data. It is possible to force the design of almost anything into an EAV data model, but then you will suffer all the pain you can possibly imagine trying to maintain or query it.

    However, not all situations are quite that extreme, for a case where there is a need for user defined metadata fields on a record the design that naturally falls out of a relational model is basically EAV with the user able to create their own "field names" and assign some arbitrary data to it on a record-by-record basis. This can be an effective design, with the caveat that such metadata is typically less queryable than otherwise.

    Even if you need differing data types (lets assume a user might want to add dates or yes/no metadata), you aren't necessarily restricted to storing everything in one varchar column - you can have different tables for each of the metadata types and allow front end systems to handle the process differently.

    Ultimately EAV is a tool - it's one that is very easy to abuse and one that people who aren't database design experts sometimes gravitate towards because it looks like something that means you don't ever have to think about databases. Of course it's precisely because of those that factors that it can quickly become an anti-pattern. Certainly if you come up with a system designed such that the entire thing is EAV, then it's almost certainly fundamentally flawed.

    a) Of course I'm thinking about the worst case scenario... It's kind of what we get paid to do. It's the same reason we go through the hassle of taking backups, setup availability groups and create disaster recovery plans. My 1st two questions tend to be, "how can this go wrong?" and "how wrong can it go?"
    When the question is EAVs, the answers are "lots of ways" and "really really wrong".

    b) So... Your answer to the EAV data type problem is more EAVs... Thank you... You just summed up the eternal struggle that results from using kluge fixes to fix the previous kluge fixes.  

    c) Why would you say this? If EAVs aren't fundamentally flawed, then how would an entire EAV database be fundamentally flawed?

  • carlo.pagliei 63874 - Thursday, November 29, 2018 1:26 AM

    Jason A. Long - Wednesday, November 28, 2018 6:33 PM

    I see your point since I don't like EAV too.
    Just saying "I'll call them evil" is easy... do you have a pragmatic solution to share ?

    Yes... Create normalized tables. We tend to give them special names like "lookup tables" but, in the end,  they are just normal, tables... And they work beautifully.

  • patrickmcginnis59 10839 - Thursday, November 29, 2018 8:07 AM

    I see them quite a bit with a particular use case, and it would be interesting to hear of an alternative implementation, pretty much if you have a database that lists things and attributes for these things. Normally you'd have attributes that specify these things in columns right? Well how would you enter a brand new attribute for this thing WITHOUT adding a new column? EAV tables just let you add a new row with the attribute and value for any particular thing. 

    Without EAV, you'd have to add a new column each time you want to have a new attribute to assist in describing this thing, but with EAV, you just add the new attribute and the value for that attribute in a new row. I suspect relational purists who hate EAV will make you alter the table to add the new attribute as a column or columns, but this means the end user cannot independently add new attributes unless they can programmatically alter the table to add a new column.

    1) Why is "WITHOUT adding a new column?" a reasonable restriction?
    If a particular attribute is indeed an attribute under the rules of normalization, why would you not add that column to the table?

    2) Maybe relation databases aren't your thing... Have you considered NoSQL  the databse where everything's made up and the points don't matter.

  • Steve Jones - SSC Editor - Thursday, November 29, 2018 9:18 AM

    @jason

    I do think EAV has a place, especially in the real world. We don't live in a perfect world, especially with regard to requirements/specifications and time. We do need to implement designs that are effective and efficient, but also timely.

    EAVs are dangerous, and should be used judiciously where they fit a problem. In this case, I have data that isn't well normalized in structure. I've had this as well in the past where we had attributes about products. Let's say I have widgets and knicknacks. They both have a price, a cost, a size, etc.

    If widgets have a UPC, I certainly could add that to the products table, but if it's not a part of knicknacks, does that make sense? Maybe. It's a design decision. If knicknacks have a capacity, but widgets don't, do I add that as well? Sparse columns were added for this reason, but to me, we're still building a strange structure that isn't normalized. Better would be to add tables for UPC and capacity,but then I get into a complex query pattern based on product.

    I have to decide on whether the normalization makes sense and sometimes it doesn't. If we're querying these attributes very rarely, and I want to keep my index small and narrow on the main table, I could do a vertical partition of the "sparse" attributes, but I could also use EAV. I've chosen both, but when I do, I also monitor the percentage of attributes v the number of entities. If lots start to need an attribute, I often think it's better to stick it back in the main table.

    That's fine. Software evolves, the requirements change. I can make the change and it usually is not a big  problem for app dev.

    What I often find is that these strange attributes are items that  are  speculative requirements on the part of the client. Many of them die away, and so the EAV is fine. Didn't really matter.

    In the case where sparse attributes are an issue, I would lean toward using separate, normalized, tables. 
    So if the only attributes that widgets and knicknacks have in common is the fact that they are both products, Then I would look at splitting them into two tables like we would do when separating durable goods from services... or.. Keep the products table and include the attributes that apply to all products and then use separate "daughter tables" to hold the attributes that apply only to specific products or product types.

  • Lynn Pettis - Thursday, November 29, 2018 11:51 AM

    I haven't had direct experience with EAV but have taken the time to research this technology and I can see use cases where they could be useful.  I did have the opportunity a few years ago to interview for a position where EAV tables were being used in automated processes dealing with compliance of software products across numerous legal and industry standards.  That company happened to be Microsoft.  If they are using them in a specific use case, seems to me that perhaps it is a tool worth understanding.

    Microsoft is the only reason I've ever had to learn to use the bitwise operators... Their ability to ignore the principals of good relational design is border line magical.
    I have a difficult time holding them up a shining example of how things should be done.

  • Jason A. Long - Thursday, November 29, 2018 2:11 PM

    1) Why is "WITHOUT adding a new column?" a reasonable restriction?
    If a particular attribute is indeed an attribute under the rules of normalization, why would you not add that column to the table?

    2) Maybe relation databases aren't your thing... Have you considered NoSQL  the databse where everything's made up and the points don't matter.

    1) The point is to let the business user enter the new attribute without programming or administrative access. If you have a high velocity of attribute additions and changes then I would think that the flexibility to add attributes through screen entry is a viable strategy.

    2) This isn't a high quality comment.

  • Jason A. Long - Thursday, November 29, 2018 2:37 PM

    Microsoft is the only reason I've ever had to learn to use the bitwise operators... Their ability to ignore the principals of good relational design is border line magical.
    I have a difficult time holding them up a shining example of how things should be done.

    Too bad this interview happened several years ago so I can't recall the details, but after listening to the details as to why they went with an EAV design, it is quite logical.  Again, it is a tool.  It is not evil in and of its self.  It comes down to how it is used.  No one here is saying you have to use EAV in your database designs.  If you choose to leave EAV out of your toolkit, that is your choice.  Only thing I have to say at this point, don't bash others that find use cases where they work.

  • Jason A. Long - Thursday, November 29, 2018 2:22 PM

    In the case where sparse attributes are an issue, I would lean toward using separate, normalized, tables. 
    So if the only attributes that widgets and knicknacks have in common is the fact that they are both products, Then I would look at splitting them into two tables like we would do when separating durable goods from services... or.. Keep the products table and include the attributes that apply to all products and then use separate "daughter tables" to hold the attributes that apply only to specific products or product types.

    Don't get caught in my example of two items. If there are 25 or 50 (or more, which I've seen), then having 25 or 50 normalized tables makes queries a nightmare. That's a real problem I've seen. Might not be your case, but depending on developer resources and other work to go, just adding tables is often difficult.

    I would argue that separate tables for two types of products is a different nightmare from an understanding and application maintenance issue. I've seen this before with individual customers and businesses. Separating those into different tables can cause all sorts of weird application problems with pulling that data back.

    We work with imperfect information, so I'd rather be slightly less normalized that perfectly normalized in a way that is cumbersome for others.

  • Jason A. Long - Thursday, November 29, 2018 2:01 PM

    a) Of course I'm thinking about the worst case scenario... It's kind of what we get paid to do. It's the same reason we go through the hassle of taking backups, setup availability groups and create disaster recovery plans. My 1st two questions tend to be, "how can this go wrong?" and "how wrong can it go?"
    When the question is EAVs, the answers are "lots of ways" and "really really wrong".

    b) So... Your answer to the EAV data type problem is more EAVs... Thank you... You just summed up the eternal struggle that results from using kluge fixes to fix the previous kluge fixes.  

    c) Why would you say this? If EAVs aren't fundamentally flawed, then how would an entire EAV database be fundamentally flawed?

    a) Sure, but sometimes the sky isn't falling down. Sometimes a pragmatic solution is actually viable. Designing every system around the worst possible case that is humanly imaginable either means developing a horribly over-engineered solution, or just that you aren't very good at imagining the worst possible case.

    b) No, just that you don't have to entirely jump on a single EAV structure without any other consideration whatsoever.

    c) An entire EAV database is (probably*) fundamentally flawed. There are very few, if any, situations where at least some of the data cannot be structured in a sensible, entirely normalized format. That doesn't mean there aren't any situations where having part of a system in an EAV type structure doesn't make sense. To declare otherwise is a lot like suggest that there is no possible legitimate use for CASE - sooner or later you reach the point where avoiding the thing you've declared "evil" just results in a far worse design than would come from studying the problem objectively.

  • Jason A. Long - Wednesday, November 28, 2018 11:40 PM

    Jeff Moden - Wednesday, November 28, 2018 8:59 PM

    Totally agreed.  Like anything else in T-SQL and SQL Server, "It Depends" and there are some incredible uses for both EAVs and NVPs.

    I totally disagree... They violate pretty much every principal of the relational model. The fact that you will, by design, give up data type constraints, having BITs, INTs (of all flavors), DATEs, MONEY and everything else, all being stored as NVARCHAR(MAX)... Because you have to be able to store anything and everything...  No FK or CHECK constraints either... So it's inevitable that you'll end up with "Amount Due = 'Bird'" or "Date of Birth = $42.99"...

    As far as calling them tools... maybe... but only in the sense that a pointy rock or broken beer bottle can be considered tools... The primary difference, of course, being that I can actually envision circumstances where I'd be grateful to have a pointy rock or broken beer bottle.

    That said, I get totally understand the aversion to arbitrarily restrictive rules that limit the tools at our disposal... So... I'll keep an open mind... What is a valid use case where using an EAV is the preferred option? And when was the last time that either of you actually chose to implement an EAV, rather than standard lookup tables ?

    Heh... I do it all the time when I get data that has columns (for example) by day (or hour or week or month) by company or person or other entity.  The "normalization" is actually to convert them to an EAV with the company or person name (Entity column), convert the dated column name to a date in the table (Attribute column), and put the "value" in the row that the original intersection indicates.  Think of it as converting each intersection (cell, if you will) to a single row that contains the correct entity and attribute in the row.  Admittedly, most people don't think of such a thing as an EAV but that's exactly what it is. 

    There are also other essentials such as column based auditing on really wide tables where the name of the table, column name, PK value, and "old value" of a (pardon the expression) "cell" in a table has been changed because you don't want the incredible wasted space associated with storing whole rows where only 1 or 2 rows fit a page and only a couple of columns per row are changed.  That's an EAV on steroids.  And, yes... I agree that it's a tradeoff in complexity (especially when trying to reconstruct a row to a point in time) to save space but it does save a huge amount of disk space when properly justified and implemented.

    As for the "other" type of EAV where users can effectively "add a column", I have a deep hatred for such things but it's also sometimes necessary because I'm not actually going to modify a large table by adding a physical column every time a user thinks it's a good idea.  It takes a whole lot of work and slows things down to infer any kind of guess at DRI (when necessary) but at least you can preserve most datatypes by using (UGH!  HAAAAAAACKKKK.... PAAATOOOOOIIIIIII!) the SQL_VARIANT datatype.

    As always, "It Depends".  Well, mostly.  If I ever ran across a company that used only EAVs like Lynn did, I'd run like hell. 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden - Thursday, November 29, 2018 5:48 PM

    Heh... I do it all the time when I get data that has columns (for example) by day (or hour or week or month) by company or person or other entity.  The "normalization" is actually to convert them to an EAV with the company or person name (Entity column), convert the dated column name to a date in the table (Attribute column), and put the "value" in the row that the original intersection indicates.  Think of it as converting each intersection (cell, if you will) to a single row that contains the correct entity and attribute in the row.  Admittedly, most people don't think of such a thing as an EAV but that's exactly what it is. 

    There are also other essentials such as column based auditing on really wide tables where the name of the table, column name, PK value, and "old value" of a (pardon the expression) "cell" in a table has been changed because you don't want the incredible wasted space associated with storing whole rows where only 1 or 2 rows fit a page and only a couple of columns per row are changed.  That's an EAV on steroids.  And, yes... I agree that it's a tradeoff in complexity (especially when trying to reconstruct a row to a point in time) to save space but it does save a huge amount of disk space when properly justified and implemented.

    As for the "other" type of EAV where users can effectively "add a column", I have a deep hatred for such things but it's also sometimes necessary because I'm not actually going to modify a large table by adding a physical column every time a user thinks it's a good idea.  It takes a whole lot of work and slows things down to infer any kind of guess at DRI (when necessary) but at least you can preserve most datatypes by using (UGH!  HAAAAAAACKKKK.... PAAATOOOOOIIIIIII!) the SQL_VARIANT datatype.

    As always, "It Depends".  Well, mostly.  If I ever ran across a company that used only EAVs like Lynn did, I'd run like hell. 😀

    Can't say everything was EAV with the Microsoft group I interviewed with several years ago.  It may have just been a few specific tables that were configured that way.  I was never shown the entire system.  So again, it is a tool that can be used where appropriate.

  • Steve Jones - SSC Editor - Thursday, November 29, 2018 3:24 PM

    Don't get caught in my example of two items. If there are 25 or 50 (or more, which I've seen), then having 25 or 50 normalized tables makes queries a nightmare. That's a real problem I've seen. Might not be your case, but depending on developer resources and other work to go, just adding tables is often difficult.

    I would argue that separate tables for two types of products is a different nightmare from an understanding and application maintenance issue. I've seen this before with individual customers and businesses. Separating those into different tables can cause all sorts of weird application problems with pulling that data back.

    We work with imperfect information, so I'd rather be slightly less normalized that perfectly normalized in a way that is cumbersome for others.

    My response wasn't based on based on 2 items or even two categories, I could and would apply that methodology as often as needed. That said, I will admit to being a bit of a fundamentalist when it comes to normalization. If an attribute doesn't apply doesn't apply equally to every tuple in a relation then it isn't an attribute of that relation...

    Your comment did, however kick off a little thought experiment. It went something like this...
    The obvious flaw in my methodology requires that I know something about my product(s) and that I know what attributes do and don't apply to them... 
    What if it were physically impossible to apply that methodology?
    What if worked for a company like Amazon who sells millions of different products that fall into thousands of different categories and many spanning multiple categories?
    If it's impossible for Amazon to keep every attribute of every product in a normalized table structure, then what are they doing and how are they doing it?
    Answer: Dunno... I don't work there... Better question... If I were tasked with designing a back end that had to do the same thing as Amazon, how would I go about it?
    And that's were I got stuck... At least until it dawned on me that they, as a company, don't care about the same product attributes that I, as a customer, care about. They don't care if a that pan I'm looking at can safely go into a 500 degree oven or that the revolutionary non-stick coating is made of magic pixie dust making it better than the other guys pan... They care about it length, width, height, weight so that they know what size box to put it in and where it happens to be in the warehouse so that the little robot can go get it and, of course, where to send it.

    It also occurred to me searching on amazon isn't a traditional parametric search. The one where you do this number: house goods > kitchen > cooking utensils > pots & pans > pans > aluminum > non-stick. I just type in a phrase that roughly describes what I'm looking for and it magically finds a bunch of stuff the more or less resembles my search. After looking at a few things to see if there different words associated with the attributes I'm looking for, I change up my wording and search again... And that gets me a little closer... I continue the refining process until I find what I'm looking for.

    Okay, so how would I go about doing that?
    Off the top of my head... I'd start by parsing out the search phrase, into individual words, make an attempt offer spelling corrections to words that aren't in my dictionary, perhaps attempt to prioritize certain words over others and then go search...

    Search what? Search where? This feels like one of those "How do I find a specific value in any column, in any table, anywhere in my database?", kind of forum questions...

    Okay... Amazon does it so you know it's possible... How do you search every attribute column all at once? 
    Answer: Unpivot all of them into two columns, one for values and the other for column(attribute) name, persist the unpivoted data and then index it...
    What does that look like? 
    Yea... It looks just like an EAV... Because it is one.

    But.. What about all the bad thing you've said about of EAVs? 
    Well... In this case, they don't matter!!! No one cares if "Pan A" has a size measured in millimeter and "Pan B" is measured in inches. No one cares that "Pan C" has 50 bullet points describing every possible attribute a pan can have, while "Pan D" barely admits to actually being a pan...
    More importantly... No one cares that "Pan E" never even showed up in the search results because it's attribute values didn't match the phrasing I used in my search!

    Now... do I imagine that Amazon is putting the stuff that THEY care about into an EAV structure?
    Nope, I don't... When they have millions of products, thousands of which, are "pretty close" to what you're looking for, the Horseshoes & Hand Grenades approach is fast, effective and efficient. But, when it comes to things like item location, order processing & billing and shipping information... "Close" doesn't count anymore.

    Is there a moral to the story?  Yes, I believe that I may have found, not just a legitimate use case for EAVs, but a case where they actually are the best possible solution. 
    Has this changed my opinion of EAVs? Maybe a little but not really... My problem with EAVs is based on the fact that the majority of them get used in places and in ways that simply aren't appropriate and are created by developers that think they've discovered a clever way to avoid the hassles of adding or modifying database objects...

    Edit: I think it should be noted that whatever method Amazon is actually using, it doesn't involve SQL Server or any other ACID compliant RDBMS. https://www.quora.com/What-database-does-Amazon-use-to-store-product-information#bwTeT

  • Steve Jones - SSC Editor - Thursday, November 29, 2018 9:04 AM

    @Carlo P - I think my handling tends to be more consistent where if you have translations, you're a) putting one language ahead of others and b) means I need to handle translations differently than one language. I think both of those incur some technical debt, but at the expense of some (perhaps) more initial development.

    Mileage may vary. I think I'd be interested in what you choose, and really be interested in an article on how/why you did this if you're up for it.

    Steve
    Consider the following scenario: you operate in a video production house that sells series to different countries...a european customer buys one of your productions for broadcasting in 4 countries and requires you to package the files for portugal, italy, spain and france with proper subtitles, paybill and abstract, all localized in their respective languages. When configuring the package builder you say "use the X metadata in the Y language and put it in a file called abstract.txt". If I use your way I have to store language and "metadata name" and use them both in a where filter, in the other way I store two IDs and need a join with the lang_id... at first look seems to me more "normalized" and structured to bind the configuration to a unique ID because the ID is a precise reference to a metadata with a defined meaning... "metadata name" is something weaker, left to the user and may change over time as I need also to translate both the name and the value of the metadata (the name for the user ui, the value for both user ui and system operations).
    What do you think ?

Viewing 15 posts - 16 through 30 (of 47 total)

You must be logged in to reply to this topic. Login to reply