Job Failure Detection can be used prevent problematic jobs from wasting precious render time on the farm. There are two types of failure detection, which are explained below. By default, jobs will fail after they have accumulated 100 errors, but this can be changed in the Job Settings in the Repository Options.
Job Failure Detection
A job will enter the Failed state when it has accumulated the maximum number of errors that are permitted. Once in the failed state, the job will no longer be picked up by slaves for rendering without manual intervention. Because of this, job failure detection can help ensure that problematic jobs are flagged appropriately and therefore won't waste precious rendering time. In the Repository Options, you can setup failure settings for an entire job, or for individual tasks.
If you've resolved the problem that was causing the job to fail, you can right-click on it in the Monitor and select Resume Failed Job. You will be prompted with the option to ignore failed job and task detection going forward.
If you choose not to ignore failure detection, make sure to clear the job's errors, or a new error will result in the job failing again because its error limit is still over the maximum. To clear a job's errors, right-click on it and select Job Reports -> Clear Error Reports.
Slave Failure Detection
Slave failure detection works a little differently than job failure detection. Basically, if a particular slave reports consecutive errors for a job, it will add itself to the job's bad slave list.
When a slave is on a job's bad list, it won't try to render that job again until all good jobs in the queue have finished. This helps ensure that one or two problematic slaves won't continue reporting errors for jobs they likely are unable to render. Once the problem has been resolved, you can remove a slave from a job's bad list by right-clicking on the job in the Monitor and selecting View Bad Slaves. You can also choose to have your job ignore bad slave detection if you wish.