• Visitors can check out the Forum FAQ by clicking this link. You have to register before you can post: click the REGISTER link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. View our Forum Privacy Policy.
  • Want to receive the latest contracting news and advice straight to your inbox? Sign up to the ContractorUK newsletter here. Every sign up will also be entered into a draw to WIN £100 Amazon vouchers!

Clusterf***

Collapse
X
  •  
  • Filter
  • Time
  • Show
Clear All
new posts

    Clusterf***

    It's funny isn't it when a live system goes belly up, how people start off helpul and then switch into self preservation mode.

    Take me for instance. A humble BA with lengthy techy background. Well respected within the team and now the leading light into the investigation. I conducted a first round of investigation with a view to getting a quick resolution and getting the system up asap.

    The symptom is data corruption. The cause unknown. So I set about building a timeline from interviewing people, log file analysis, etc etc. A timeline with key events that lead to the system downfall.

    Interestingly the data corruption has been going on for some time, but ignored it seems. Until recently when the problems got so bad it was not possible to ignore them.

    Imagine my delight when one such interviewee told me a manual database backup failed because of table locks, and upon further investigation, structural problems with the table which he "fixed on the spot". Ha! A smoking gun. Better check what the database maintenance procedures are. You know, good old fashioned consistency checks, repairs, error reporting, optimisation.

    None. None at all. Yipeee, a smoking gun and a witness statement, and a positive ID in a police lineout.

    What to do? Repair the table. Repair the data. Audit the data. Regression test. Put live. Good idea. (Limited time to conduct a full investigation - this looks like probable cause)

    So I get as far as the regression testing when my key witness has an attack of amnesia. He now denies there were any problems with the database at all. Whatsoever.

    I have to now reopen the investigation and go back to square one.

    For any smart arses on here, we wanted to identify the high probability causes quickly and proceed with a moderate risk level (mitigated with frequent backups and rollback points) with a view to getting the system up as soon as possible.

    The point here being that Mr Amnesia has cost us a week of investigation, planning and rectifying that may well be for nothing.

    As we now have no probable cause we have to go back to the drawing board and start the investigation pretty much from scratch. I have told this business this will take a lot longer. The noose is tightening.

    Now I know the panel on here will laugh, laugh some more, slag me off, make out it's all my fault, laugh some more etc etc. But in among all that noise, all that dross and tomfoolery may be a germ of an idea, some pearl of wisdom that may invoke a thought process in me that leads to the solution.

    As such I will personally put up a financial reward for anyone that does this. If multiple people share the idea in various posts you can either split the reward or give to charity. PM me for details.

    Failing all else this will provide SASguru with some much needed mirth and pisstaking. So either way a total win win.

    Thinking caps on chaps, Miss Marples, Inspector Cluseaus, Hercule Poirots.
    Knock first as I might be balancing my chakras.

    #2
    First rule of any business investigation. Identify the "blame to" party. Which, to give you credit, you managed to do. Where it went pear-shaped was allowing him to wriggle out of it.

    You know he's lying. So you know that the cause of the db corruption was manual intervention. Why can't you continue with that?
    Down with racism. Long live miscegenation!

    Comment


      #3
      Originally posted by NotAllThere View Post
      First rule of any business investigation. Identify the "blame to" party. Which, to give you credit, you managed to do. Where it went pear-shaped was allowing him to wriggle out of it.

      You know he's lying. So you know that the cause of the db corruption was manual intervention. Why can't you continue with that?
      Good work and thanks for your swift reply. If the MySQL binary transaction log (exhibit a) records session information then he won't be able to wriggle out of anything. I have not taken "sabbotage" - deliberate or mistaken - out of the equation. It is on my list of possibilies. Transaction log analysis is high up my list of next tasks.
      Knock first as I might be balancing my chakras.

      Comment


        #4
        Originally posted by Lightship
        One word: documentation.
        Care to elaborate, there's a packet of bourbons riding on this for the winner
        Knock first as I might be balancing my chakras.

        Comment


          #5
          Originally posted by Lightship
          It doesn't really help you now, but you should have documented all the interviews that you conducted, and had the interviewee sign off that your interview notes were accurately recorded.

          To be a detective, you have to think and behave like one.
          Not that dumb. I have this interview recorded. His amnesia has not gone down well.
          What am I, some kind of amateur?
          Knock first as I might be balancing my chakras.

          Comment


            #6
            Are you at my clientco?

            Turned up last week to put together a dashboard for a client. Client wanted historical trending. Went to look at the audit trails(none enabled)

            So tried to look at the log file through the application. Hit the audit button and the whole system went down. Call centre staff running around and the head of the call centre asked if I was doing anything. Not I, I said puzzled. System restored I hit the audit button again. Lo and be hold the system went down, lots of running around. What happened was the question? It shouldn't do that I replied!

            On investigation it was discovered the audit log wasn't indexed. On further investigation discovered that the audit log was 35 million records(from 65000 tickets). It took another day to find a trigger with the a greater than as opposed to a less than. For 3 years it has misfired. Asked why they hadn't noticed, they just thought it was slow, I said I was surprised the system hadn't fallen over. No worry we have a daily backup they said.

            So I implemented Plan MF. I will create your missing audit tables from the log files(which is quite complex). But first I need to remove the rogue 34.5 million records as your system will go down if I touch it. Can I have last nights backup?

            Yes of course.........

            Oh........

            It hasn't backed up because there isn't enough disk space........ FOR A YEAR!!!!!!

            So the answer to SY's question is. See what you can derive from the Log files and make sure your server has enough space to take the backup oh and write up everything into a problem/cause/solution document.
            What happens in General, stays in General.
            You know what they say about assumptions!

            Comment


              #7
              Originally posted by MarillionFan View Post
              Are you at my clientco?

              Turned up last week to put together a dashboard for a client. Client wanted historical trending. Went to look at the audit trails(none enabled)

              So tried to look at the log file through the application. Hit the audit button and the whole system went down. Call centre staff running around and the head of the call centre asked if I was doing anything. Not I, I said puzzled. System restored I hit the audit button again. Lo and be hold the system went down, lots of running around. What happened was the question? It shouldn't do that I replied!

              On investigation it was discovered the audit log wasn't indexed. On further investigation discovered that the audit log was 35 million records(from 65000 tickets). It took another day to find a trigger with the a greater than as opposed to a less than. For 3 years it has misfired. Asked why they hadn't noticed, they just thought it was slow, I said I was surprised the system hadn't fallen over. No worry we have a daily backup they said.

              So I implemented Plan MF. I will create your missing audit tables from the log files(which is quite complex). But first I need to remove the rogue 34.5 million records as your system will go down if I touch it. Can I have last nights backup?

              Yes of course.........

              Oh........

              It hasn't backed up because there isn't enough disk space........ FOR A YEAR!!!!!!

              So the answer to SY's question is. See what you can derive from the Log files and make sure your server has enough space to take the backup oh and write up everything into a problem/cause/solution document.
              MF is well out in front at the moment. Are you seriously going to stand for this people?
              Bonus points for showing me how to analyze / replay a MySQL transaction log.

              PS Absolutely love the tags on this thread. Cried with laughter. Thanks "whoever"
              Knock first as I might be balancing my chakras.

              Comment


                #8
                Originally posted by Lightship
                Then, you have a record of exactly what he did when he "fixed [the structural problems] on the spot".

                So, as NAT said.....
                Oooo. Good spot. Did he change the table structure or columns. Check date/times for indexes/last modified/modified by/new columns. You can check between the pre back up and the live. You can then pinpoint when said changes were made.......... Then you will have your murderer!

                BTW. I've taken down the odd Enterprise system with a mistimed change and I've wriggled my arse out of it pinning it on IT Support a few times be wary it's not me your trying to investigate.
                What happens in General, stays in General.
                You know what they say about assumptions!

                Comment


                  #9
                  In the good old days we just blamed the hardware guys.
                  bloggoth

                  If everything isn't black and white, I say, 'Why the hell not?'
                  John Wayne (My guru, not to be confused with my beloved prophet Jeremy Clarkson)

                  Comment


                    #10
                    Originally posted by suityou01 View Post
                    Care to elaborate, there's a packet of bourbons riding on this for the winner
                    I'll need more than a packet of bourbons.
                    +50 Xeno Geek Points
                    Come back Toolpusher, scotspine, Voodooflux. Pogle
                    As for the rest of you - DILLIGAF

                    Purveyor of fine quality smut since 2005

                    CUK Olympic University Challenge Champions 2010/2012

                    Comment

                    Working...
                    X