ChatGPT 4o vs. o1 vs. o1 Pro testing results

I went ahead and sprung for the $200 a month for ChatGPT Pro as I do a lot of data analysis on very large and complicated data sets.  

The first thing I through at it was last month's full Perforce logs with diffs and and all so that I could get a nice friendly report of the various developers, how many CLs they did and an evaluation on the type of work they did and how much work they did.

Here are my results:

  • ChatGPT 4o.  Basically useless.  It didn't even get the number of CLs right and refused to even try to give a scope of work.
  • o1. Much better, it got the number of CLs right but it couldn't really figure out how much work each did.  But it still gave me a decent overview. 
  • o1Pro.  Very impressive.  The big difference here is that it went and did a relative analysis of each so that the work rating of each was relative to each other.  

So is o1Pro worth it? Not unless you're doing a lot of data analysis and even then... I'm not sure.  

For me specifically it is probably worth it.  There are plenty of tools out there that will do analysis on this kind of thing but the thing is, I don't have to know or care about them.  This is a generalized tool.

27,832 views 1 replies
Reply #1 Top

We code in hardware development, too (not just software), and one thing I always hated was being assessed based on how many lines of code I changed.  Or worse--how many new lines of code I added.  I deliberately strive for the simplest solutions possible, and to perturb the existing code the least bit possible, but when I get asked at performance time how many lines of code I did, you are incentivizing me to write bad code.  The assessment needs to be how many problems did you actually solve, and how materially you improved the product.  An AI can probably improve the data you get to make that subjective evaluation (and to justify it), but that evaluation probably will need to remain subjective.   But good grief, if you're still going to count the number of lines we coded as our performance metric, at least include the comments in that count.  Do you want to disincentivize us from documenting our code, too?