PERFORMANCE TUNING

When performance tuning, it’s not done in a vacuum, OS, or Hardware, or Process. You debug the bottlenecks of the system as a whole until you narrow down the cause.

System performance bottlenecks can be categorized as follows:

I/O Issues

  • Disk I/O
    • If it takes us a long time to read, this affects everything, including below where I reference page faults

    • Read Disk Queue Length

    • Read time

    • If it takes us a long time to write data, we normally hold write locks, which ties up OTHER processes as well.

    • Write Disk Queue Length

    • Write time

    • factors
      • disk speed - ie SSD vs Legacy Spinning drive.

      • If spinning drive, speed of drive
        • 7200 RPM vs 10k RPM

        • amount of on disk cache

      • SCSI controller caching - write through vs actual caching.

      • If NAS or SAN
        • utilization of fabric connection
          • is your fiber switch bandwidth consumed?

          • do you have clean optics?

        • What RAID strategy is in use?
          • Some are less effective heavy writes (like RAID 5)

          • Some are more effective for fast reads (striping)

  • Memory I/O
    • Usually not really an issue except on older systems or when working larger datasets

    • Usually this is tied to Memory speed and Memory bus bandwidth - which is close coupled with CPU memory bandwidth as they work together.

    • Some architectures do this better - mostly newer is better

    • Big Iron, aka mainframes have MASSIVE I/O capabilities.

Capacity issues

  • Memory Capacity
    • Without enough memory, you have a lot of page faults, which hurts performance
      • page fault is when the process has to pull something in from swap/disk virtual memory into physical memory.

    • page faults/sec
      • how often we find that what we need is not currently in physical memory

  • Processor Capacity
    • If we have a limited processor capacity you will see a lot of context switches, as it roundrobins between processes to give them a CPU slice

    • Process level context switches are expensive from a performance standpoint.

    • If you have a lot of things that need to run concurrently, better to have a multiprocessor system (which is pretty much the norm now)

    • CPU Utilization
      • if you have every clock cycle packed with instructions and you don’t have any more clock cycles, you’re done.

      • If your system under a normal load runs at 88% CPU Utilization, you don’t have much overhead left if you need to really make it a busy system for a big job.

      • You probably need more cores.

Programming/Process issues

  • Use of blocking I/O
    • It’s better to have I/O run in a separate thread or thread pool rather than have it tied to main program execution thread as most I/O is blocking.
      • by blocking I mean the program basically halts until the I/O is finished before it continues.

    • Waiting on Network I/O
      • is the network slow? So will be your I/O, and if it’s blocking I/O even worse.

    • Thread Contention/locks
      • Concurrent programming requires thread synchronization and lock/unlock/notify mechanisms to access shared objects. If this is done poorly, you can get thread contention, where your processes are basically just sitting waiting in line to access an object for longer than needed, or in worst case, indefinitely - and that is called thread deadlock.

  • Programming paradigm
    • parallel/concurrent programming with multiple threads allows your processes to utilize mutliple CPU cores and take advantage of that horsepower that is available.

    • legacy apps often aren’t multithreaded well, or at all. This means you can have a 24 core system, and the OS can schedule your thread to run on any one of those procs, but just one. All processing is done in one thread, so execution is linear, not concurrent.

  • Use of on-disk files as caching mechanisms
    • bad idea from an I/O time perspective

    • bad idea from a disk life expectancy standpoint.
      • SSD life expectancy for instance is a finite number of writes/state transitions.

      • The less you write, the longer its life.

  • The use of ill-suited data structures
    • for example parsing a flatfile line by line works fine if it’s a 200kB file, not so much with a 4.5 GB file